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Abstract 

It is attractive to implement a large scale parallel processing system as a self-timed hardware 
system with decentralized control and to improve maintainability and availability in such a system 
through fault tolerance. In this thesis we show how to tolerate hardware failures in a self-timed 
hardware system with a packet communication architecture, designed to execute parallel programs 
organized by data flow concepts. 

We first formulate a design methodology for incorporating redundant hardware into self-timed 
systems for fault tolerance. Redundancy management problems in self-timed systems arc illustrated 
with a byte-sliced hardware module structure. Robust algorithms are given for synchronizing byte 
slices in a redundant module so mat their outputs can be decoded to detect and/or mask hardware 
failures. Hardware implementation of these redundancy management algorithms is studied under a 
stuck -at fault model, a random pulse train fault model and a random wave train fault model 

In studying the design of fault-tolerant data flow processors we have also developed a dynamic 
redundancy scheme for masking hardware failures in a multiprocessor architecture designed to 
execute parallel programs organized by data flow principles. Novel features of mis architecture 
include use of packet networks to support communication aomg processing elements and dynamic 
allocation of a homogeneous set of functional units to service requests. Program organization and 
hardware module designs to support the dynamic redundancy scheme are described. 
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1. Introduction 

In this thesis we study the following fault tolerance problem: 

"How can hardware failures be tolerated in a self-timed hardware system which is 
organized by a packet communication architecture and designed to execute data flow 
programs?" 

We first explain our motivations for studying fault tolerance problems in mis context 

Computer systems which are significantly more powerful than those presently available are 
needed in weather forecasting and aeronautical design. These physics problems are often formulated 
as a set of difference equations for numerical solution. It is readily apparent from studying these 
equations mat many intermediate values required for their solution can be computed in parallel; and 
so it is attractive to design computer systems which can exploit such parallelism through hardware 
concurrency. Many approaches to organize a large number of hardware units so that a good portion 
of them can participate in the computation in progress simultaneously have been studied. 
Architectures based on vector processing and array processing are available commercially. Another 
approach which appears to be very promising, and provides the environment in which the research 
reported in this thesis is carried out, is to construct computer systems based on data flow concepts. 
Current projects on data-driven computations and highly concurrent systems based on these concepts 
are surveyed in [501 [51]. 

The data flow approach to parallel processing is based on the observation that an operation is 
enabled for execution as soon as the operands it needs are available, and hence need not be further 
delayed by other artificial sequencing constraints. In die internal representation of an application 
program in a data flow system, data dependencies among operations are stored explicitly and used 



directly to schedule operations for execution. In the data flow computer architectures [24], [25] 
studied at MIT, application programs are stored as program graphs, and hardware units are designed 
to execute program graphs directly. 

In a conventional computer organization, operation of hardware units such as arithmetic-logic 
units, shifters and register files are coordinated in machine instruction execution by control signals 
generated by a control unit Many mainframe computers, minicomputers and microcomputers are 
implemented under this control strategy. We call this mode of hardware operation centralized 
control. To gain orders of magnitude improvement in performance over state-of-the-art 
uniprocessors, a parallel processing system must incorporate a large number of hardware units and 
exploit parallelism in computations to achieve high throughput It seems mat to operate a large 
number of hardware units under centralized control, excessive demand will be placed on the 
complexity and performance of the control unit Performance is compromised if the control unit 
becomes a system bottleneck due to the complexity of me control tasks it must carry out On the 
other hand, only restricted forms of parallelism, such as pipelining and array processing, can be 
realized as simple control tasks. For a large scale architecture it is thus attractive to use decentralized 
control. Packet communication architecture is an organizational scheme to support decentralized 
control 

Under the packet communication principle hardware units in a system communicate with each 
other by exchanging information packets, and packet processing is carried out at each unit 
autonomously. The size and complexity of mese units depend on bom system design and 
implementation technology. In an architecture designed for high speed numerical computation, for 
example, each unit may consist of a small number of VLSI chips and a packet may consist of a pair of 
floating point numbers and some control information. 
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The packet communication discipline has been used to organize several data flow processors 
[24], [26] and a high throughput memory system [21]. We have also studied design methodologies for 
constructing hardware systems organized by a packet communication architecture, which we shall 
refer to as packet communication systems [33]. Formal methods for studying properties of packet 
communication systems have been studied fa [27] and f32J. in this thesis we study architectural 
organizations appropriate for tolerating hardware failures fa packet communication computer 
architectures, and will illustrate our ideas with a packet communication architecture designed to 
execute a class of data flow programs. As part of mis study we have also developed a set of concepts 
and techniques for incorporating redundant hardware into self-timed hardware systems [48] to combat 
hardware failures. 

In a conventional computer organization tow level hardware synchronization, between a 
combinational circuit and its input and output registers, for example, is governed by a set of periodic 
signals generated by a system clock. Due to the use of a system dock, these hardware systems are also 
called synchronous hardware systems. A decentralized control computer architecture can be 
implemented as a synchronous system. A large scale synchronous system, however, suffers from poor 
modularity because synchronous operation introduces strong interaction among hardware modules, 
indirectly through die timing characteristics of the clock signals. The design of each hardware 
module must be consistent with the period and frequency of the clock signals, even though these 
characteristics may have been imposed by the synchronization requirements of other hardware 
modules. When two neighboring hardware modules synchronize their input, processing and output 
activities indirectly by referencing a common timing signal and one of these modules is modified, the 
modification must not violate the timing assumptions made in Htftie other module. Otherwise the other 
module must be modified accordingly. There are also skew problems associated with clock 
distribution in a large scale system. For these reasons we also favor implementing packet 
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communication system as an interconnection of hardware modules which exchange packets under 
asynchronous two-way handshake protocols. 

Under an asynchronous packet communication protocol, packet communication between 
modules is synchronized entirely through exchanging signals among them. We call hardware 
modules which interact with their environment via these protocols packet communication modules. 
An example of asynchronous packet communication is given in Fig. 1.1. Packets are delivered as a 
sequence of packet bytes over an 8-bit data link (Fig. 1.1a). Packet byte receipt and transmission are 
synchronized by exchanging control signafc between the sender and the receiver over the control link. 
Packet communication using the control discipline of transition signaling is illustrated in Fig. 1.1b. 

Since a packet communication system makes use of handshake protocols for sending packets 
between modules, it is also a self-timed hardware system, the redundancy management concepts 
and techniques developed in mis thesis have been tailored specifically for packet systems, but they 
are also applicable to other classes of self-timed hardware systems KJ. Pll {39J, f49J. 

Let us consider designing high performance computing systems for physics simulation such as 
in weather forecasting and aeronautical design. The physics simulation environment is different from 
traditional application domains for fault-tolerant computers, such as process control and 
interplanetary spaceflight, in mat it inherently has no stringent reliability requirements. It is 
nonetheless attractive to use fault tolerance techniques to improve availability and maintainability. 
System throughput is improved if hardware failures can be masked, especially since it is not unusual 
for a numerical computation in physics simulation to execute for many hours. It is extremely 
annoying if the entire computing system is disabled because of one malfunctioning transistor or one 
loose connection among the hundreds of thousands in use. These considerations have motivated the 
research reported in this thesis on fault tolerance in packet communication computer architectures. 
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The problems studied in this thesis will be explained in more detail after we introduce some 
fundamental ideas for designing computing systems which can tolerate hardware failures. 
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1.1 Fault Tolerance 

1.1.1 Basic Concepts 

The following paragraphs on fault tolerance for physical faults is excerpted from the paper 
"Fault-tolerance: The survival attribute of digital systems" by Professor Algirdas Xvizienis of UCLA, 
published in die Proceedings of the IEEE, vol. 66,nb.ld.f53} 

"The notion of fault tolerance for physical faults as an attribute of digital systems requires the 
viewing of the system from two different viewpoints: physical and logical Bram the viewpoint of 
purely physical observation and measurement* digital systems are physical systems, built of large 
assemblies of electronic, electromechanical and odier physical components. The components obey 
the laws of physics and the behavior of me system can be described in terms of time functions of the 
values ofphysical variables: voltages, currents, positions, etc. 

"From the viewpoint of die human society at large digital systems are "Mack boxes' that 
perform information processing to meet human needs. Indie very general view of Newell and Simon 
[401 they serve as physical symbol systems, in which physical variablesr represent other entities - 
symbols, expressions, and processes, and die system is capable of designation and interpretation. At 
die level of greatest formal detail die information processing behavior of a digital system may be 
specified in terms of a finite-state sequential machine. Outside of the formal interpretations, the dafiy 
user sees his box of digital hardware as a useful artifact that can store lame quantities of symbolic 
information and can carry out information processing operations that are specified by his programs. 
In this perspective, it is only the logical activity of information processing that concerns die user, and 
not the physical properties of die circuits inside. As long as die logic machine runs well what 
happens with die physical system that is its host is of very littie, if any, interest 
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'Thc use of the physical system as a bgic machine is based on me convtntiowAat the values 
of physical variables are interpreted as the discrete values of logic variables, and that the speed with 
which transformations are carried out is limited by the physical properties of the hardware. The logic 
machine behaves in the specified manner «S long «s foe parameters of physical components and the 
speed of operation remain within specified limit*. However, it has been acojrimon experience that 
unexpected out-of-specification physical changes in component parameters do occur in all kinds of 
digital hardware. They are usually called malfunctions when the changes are temporary, and failures 
when the changes are permanent Their effect is to cause an unspecified and disruptive change of 
one or more logic variables of the logic machine. Mi a change of logic values is called a physical 
fault, or simply a fault when the physical cause of toe fault is dear from the context of the discussion, 
as it is here.... 

"It is the possibility of randomly occurring fault* mat makes the user uncomfortably aware of 
the physical side of his logic machine. The occurrence of a fault often causes an error, that is, a 
deviation of the logic machine from its program-specified behavior (transition through a sequence of 
specified states) into i sequence of error states. While in an error state, me logic machine fails to 
perform correctly at least one of in specified tasks, and suffers a partial or complete failure to carry 
out the information processing function. Such failures are •commonplace experience for the users of 
amteniporary information processing systems. The usual solution of the problem fe a manually 
controlled maintenance action that results in the removal or repair of the cause of the fault The 
system is then restarted to run until the next fault strike*. 

"The purpose of feult-tolerance is to offer an alternate solution to the fault problem in which 
the detection of faults and the recovery to normal operation are carried out a* internal functions of 
the system itself. Fault-tolerance is the unique attribute of a digital system which makes it possible 
for the system to continue with its program-specified behavior as a logic machine after the occurrence 



13 



of faults. It may be said that fault-tolerance b die survival attribute of die logic machine because its 
purpose is to cause a return from error stales back to the specified behavior, thus assuring the survival 
of the information processing activities. 

" ... fault-tolerance requires additional hardware and/or software that is redundant during 
normal operation and would be entirely supeifluous in a completely fault-free logic machine. In me 
fault-prone physical implementation, fault-tolerance is me insurance of die logic machine against 
disruptive physical events. Its function is to react to the occurrence of faults and to protect die logic 
machine against the imperfections of the physical system that serves as its host ..." 

The? above excerpt serves to introduce the bask: notions of failures, faults, errors and fault 
tolerance for fault-tolerant computmg. We use me teMts failures, faults and errors as explained 
above. A failure is an unexpected out-of-specification physical change in component parameters. 
Examples of failures are out-dNpec transistors and loose connections. A fault is an unspecified and 
disruptive change of logic values caused 1$ a failure. A failure occurs in a physical machine while a 
fault occurs m a logical machine. A fault model relates die two phenomena by postulating how logic 
values may change when physical famiresxxxur. The most common fault model is the stuck-at fault 
model which postulates that logic variables will become stuck at either logical 1 or logical as a result 
of physical failures. An error is a deviation of the logic machine from its program-specified behavior 
into a sequence of error states due to faults. Faults may be tolerated such that their occurrences do 
not cause errors. A simple example of fault tolerance is to use bit-sliced memory systems protected 
by error correcting codes so that when a single bit-slice fails and causes one bit of die data word to be 
faulty, the contents of die word can still be rec»vered and no error can occur. 

We shall restrict our attention to tolerating physical failures in this thesis, and will not deal with 
human mistakes that occur when the computing system is designed, implemented or in use. 
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Fault-tolerant systems differ in the protections each offers against physical failures. For system 
design the related notions of maintainability and availability are also useful. Maintainability is the 
property of having low maintenance costs. Availability is a measure of die fraction of the useful life 
of a computing system during which (he system is available for performing die functions it is 
designed for. Communication processors arc often designed to be highly available, even though 
there is no guarantee that conversations in progress when failures occui; wffl not be disrupted. 

The logical machine we work with in this thesis is a computer system designed to execute a class 
of data flow programs. The underlying physical machine is a hardware system organized by a packet 
communication architecture. We study architectural organizations and hardware redundancy 
techniques appropriate for tolerating certain types of hardware failures in the physical machine. Our 
design goal is that a data Mow program running on the computer system when failures in die specified 
class occur can always run to completion, and compute the same, results as if no failure has occurred. 
In other words, the program execution capabilities of die logical machine are preserved as long as 
only failures in die specified class occur in die physical machine. The system design presented in this 
thesis achieves fault tolerance by integrating program organization, architectural features and 
redundant hardware in the physical machine into a system strategy. This strategy is effective only for 
certain types of hardware failures in die physical machine. Hardware failures in a physical machine 
are characterized by their extent, as well as by fault models. We will study fault models for packet 
communication systems in detail. For now we state our assumptions about the extent of hardware 
failures that can be tolerated by our strategy. Since die extent of failures can only be described in 
terms of physical structure, we next introduce die structure of redundant hardware modules used in 
this thesis. 
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1.1.2 Structure of Redundant Packet 

We propose a design methodology for fault-tolerant packet communication systems consisting 
of the following steps: 

(1) A packet communication system is designed according to Ms ftinctional specification. We refer to 
this initial design as the non-redundant system. 

(2) Fault tolerance is incorporated into the non-redundant system by incorporating redundant 
hardware into each module in die non-redundant system and into the connections between 
modules. Hie product of mis second step is a redundant system topokgically similar to die 
non-redundant system, but constructed out of redundant modules connected through redundant 
links. 

For this diesis, we furthermore assume that redundant packet communication modules have a 
byte-sliced internal organization, constructed with failure-independent fete j/kks. 

Byte slicing is a common technique for implementing memory systems and arithmetic-logic 
processors. A memory chip or an arithmetic-logic chip implements the specified function for fixed 
size bytes. An array of such Chios » used to Hnpkroent In 

general, each byte slice should receive all input words, aod control and, data «rc passed between slices 
in the same array, (In a memory system or an aritometic-kific pnw«or each byte-slice receives only 
one byte of each input word, and ia a memory 4|s|ijpi|b^^|(^^ 
chips.) An output wordof abyte-sUc^jn od u l^f tJ fl^ *»c h s l i ne. 

For fault tolerance, redundant slices are introduced into a byte-sliced module and the bytes in 
each output word are correlated by an encoding scheme. In an encoding scheme a hardware module 
is modeled as a function which maps an input bit vector into an output bit vector, and a module 
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failure is modeled as a fault junction which maps a fault-free output bit vector into a faulty one. If 
each output bit vector consists of n bits, men an encoding scheme C partitions the set of all n-bit 
vectors into a code space Sq and a non-code space Nq. To detect hardware failures in a module, C is 
chosen such that: 

(1) All output vectors generated by the module during fault-free operation are in Sq, and 

(2) Under the most common hardware failure modes in the module no faulty bit vector in Sq can be 
generated. This property implies that under die most common failure modes the outputs 
generated by the module is either the fault-free result or is a non-codeword in Np. 

When the output words of a module are encoded by a scheme C satisfying these two properties, 
the most likely hardware failures in the module can be detected by checking these words for code 
space violations. An encoding scheme can be used to mask failures in a hardware module if it 
satisfies (1), (2) and (3): 

(3) Under the most common hardware failure modes in the module, each faulty output vector in N c 
can be decoded to derive die fault-free vector. 

The byte-sliced organization is particularly amenable to applying encoding techniques since 
under some of the most common failure modes, failure of a single byte sBce, for example, only one 
byte in each output word will be affected. Since the pioneering work of Hamming PH much wort 
has been done on applying coding techniques to detect and mask failures in byte-sHced arithmetic 
processors and memory systems f61, [13}, $51 ftff- Hardware failures are also often dealt with using 
modular redundancy. In a modular redundancy scheme several copies of a hardware module are 
operated together and their outputs are compared or voted upon to detect and mask hardware 
failures among them. Modular redundancy can be regarded as a special case of byte slicing and 
output encoding in which all slices perform identical functions and each output word consists of two 
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or more identical bytes. 

Each byte-sliced module M in a redundant packet communication module is designed so that 
failures among byte slices in it can be detected and/or masked by decoding their outputs, as long as 
the extent of hardware failures is limited to less than some fixed number, M^, of byte slices in the 
module. 

A redundant packet communication system constructed using these byte-sliced modules can 
only tolerate hardware failures whose extents are thus limited. It is a reliability improvement over its 
nonredundant counterpart, however, only if 

(i) All the redundancy management mechanisms are designed oefltetry and operating reliably, and 
(ii) In the implementation technology, the probability that more man N^ slices have failed in a 

redundant module M is less than the probability that the nonredundant counterpart of M has 

fiakd. 

We have made these reliability assumptions in carrying out the investigation reported in mis thesis. 

1.2 freMe»SUfte*atf 

There are two major results reported in this mesh: a conceptual design for a fault-tolerant 
packet communication computer architecture, and a set of concepts and techniques for managing 
redundancy in setf-timed haidware systems. T£e architecture design fmb^en tailmetfsptt^ftcaBy for 
data flow computer organizations and the fault tolerance features are built directly on program 
execution mechanisms at the machine level. The hardware modules in this architecture include 
processing elements, which provide storage for the program in execution and simple functional 
capabilities, specialized functional units for performing more complex operations, routers for 
constructing routing networks, and allocators for constructing allocation networks. The properties of 
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these modules are explained is more detail in Chapter 6. For now we note that the redundancy 
concepts and techniques we have developed for self-timed hardware systems are applicable to 
tolerating hardware failures in the class of packet communication systems described in (33], including 
all these hardware modules used in the data flow computing system. Thus while the redundancy 
management concepts and techniques have been developed to provide implementation tools for 
realizing the fault-tolerant architecture as specified, they are in fact more generally applicable. 

Features of the data flow processor are described in Section 12.1. The need for introducing 
new concepts and techniques to manage redundancy in self-timed systems is illustrated in Section 
1.22. Some of the technical problems that arise in implementing our redundancy management 
schemes mhardwarcare discussed in Section 1JL3. ; n 

1.2.1 Design of a Fault-Tolerant Packet Comwunkatlon Computer Architecture 

Most computer systems [8J, [30], [58], [67] designed to tolerate hardware failures are intended 
for high reliability or long life applications with modest computational requirements. In this diesis, 
we study fault tolerance techniques to cope with hardware failures in a multiprocessor designed to 
execute programs expressed in a subset of the data flow language presented ta {2$. We snafl refer to 
this multiprocessor system as a data flow processor for convenience. This data flow processor has 
several novel features: 

Fault-tolerant multiprocessor systems [11], [30] have been designed using multiple, identical, 
hardware units. A program running on die data flow processor is partitioned and stored on a set of 
identical processing elements. The data flow processor also has a homogeneous set of specialized 
functional unite for performing complex operations. These functional units are allocated dynamically 
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to service requests from the processing elements. The, dynamic allocation scheme provides direct 
support for graceful degradation with respect to these functional units. Programs prepared for 
execution on die feuk-free data flow processor can run without modification if only a sunset of 
functional units has failed. 

2. Communication among processing elements and Junctional units is supported by packet networks. 

In the data flow processor hardware modules serve two distinct functions - processing and 
communication. Processing elements in a data flow processor execute subcomputations concurrently. 
Communication between the processing elements is supported by packet networks, to be constructed 
out of a few basic LSI cell types. This architecture is quite different from most fault-tolerant 
computer architectures reported in the literature, which are bus-oriented von Neumann architectures 
18, 59, 58] or bus-oriented multiprocessors (30, 671 Store-and-fbrward packet network designs which 
can handle a large number of packets concurrently have been analyzed in [12]. In some of these 
networks the number of bask modules and die length of connections between diem both exhibit 
faster than linear growth as the number of processing elements being serviced increases. It thus 
appears that a substantial amount of hardware in a practical data flow processor will be used to 
implement packet networks. The reliability of these ffctworks will be an important factor in assessing 
system reliability and availability, and it is important to minimize die amount pf redundant hardware 
invested in them to achieve a desired level of fault tolerance. 
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3. Hardware in the data flow processors organitedasapackei communication architecture. 

The data flow processor hardware is organized by a packet commmkajton discipline to support 
concurrency and modularity. Computer architectures are commonly implemented as synchronous 
digital systems in which events in all modules are synchronized with reference to a common timing 
signal. Many fundamental fault tolerance techniques have been developed in this context In 
contrast, a packet communication computer architecture is implemented as an interconnection of 
self-timed modules whose activities are synchronized through localized signal exchange, in 
accordance with me adopted packet communication protocol. Fauh tolerance techniques for 
asynchronous systems have been demonstrated previously in a fault-tolerant ckick design [16J and are 
reported in a paper on synchronization voting by Davis and Wakerly JI7J. We have generalized these 
techniques to show that byte slicing and coding techniques can be used to mask and detect failures in 
self-timed systems under more general fault assumptions than the commonly adopted stuck-at fault 
model 

In this thesis we present a fault-tolerant data flow processor design based on dynamic 
redundancy. In a dynamic redundancy scheme redundant hardware is incorporated to support 
on-line fault detection. Upon detecting a hardware failure, normal processing is suspended. The 
system is diagnosed and repaired (possibly degraded). Norma) processing is then resumed and 
subcomputations contaminated by failures are reexecuted. 

The dynamic redundancy scheme uses a combination of hardware-implemented fault detection 
and packet encoding techniques to mask packet network failures in the fault-tolerant data flow 
processor. It uses a retransmission strategy which is conceptually similar to error control techniques 
used in packet switched computer networks like die ARPANET. Differences in failure characteristics 
and performance requirements between these two types of networks have led to different 



21 



implementation strategies. We have also developed strategies for incorporating redundant hardware 
into a packet network to support rapid repair. 

In some computer systems hardware failures ate masked using module replication and majority 
voting. Another example of hardware-implemented fault masking is the use of error correcting codes 
in bit-sliced or byte-sliced memory systems. The hardware redundancy techniques presented in mis 
thesis can be used to incorporate redundant hardware to either mask or detect failures on-line in a 
self-timed hardware system. It is thus possible to design a fault-tolerant packet communication 
computer architecture based on hardware-implemented fault masking. The dynamic redundancy 
scheme we have developed offers the potential of considerable hardware savings, especially in the 
packet networks. Hardware-implemented fault masking offers two advantages over a dynamic 
redundancy scheme: the computation in progress need not be held up until recovery is completed, 
since hardware failures are masked on-line; and no additional programming effort is required to 
insert checkpoints into application programs. The first capability is important in meeting stringent 
response time requirements, in process control applications, for example, but is not essential for 
numerical computation. As we shall see, die dynamic redundancy scheme we have developed 
performs recovery at die machine instruction level, and hence has no impact on applications 
programming. 

When detailed logic designs and hardware failure rates are available for a hardware 
implementation, alternative schemes may be carefully evaluated to determine their cost-effectiveness. 
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1.2.2 Redundancy Management to Self-Timed Hardwire Systems 

In this thesis a data flow computing system is implemented hi a physical machine organized by 
a packet communication architecture. Our dynamic redundancy scheme must be supported by 
hardware-implemented fault detection. In Section 1.1 we have introduced a byte-sliced hardware 
structure for redundant packet communication modules. We wUl use this structure to incorporate 
redundant hardware for fault detection. Given a specification of the non-redundant module and the 
extent of physical failures to be detected, we can pick an encoding scheme and then design the byte 
slices accordingly. 

Let us consider how the notion of encoding the output words of byte-sliced modules is used to 
deal with byte slice failures in a synchronous hardware system under the stuck-at fault model. In a 
synchronous environment die byte slices in a module are synchronized with each other, and with 
rault handlers monitoring the outputs of die module, via a system clock. Under the stuck-at fault 
model the output lines of a failed module become stuck at either a logical or a logical 1. As long as 
the clock signals are generated in accordance with their timing specifications and module failures 
obey the stuck-at fault model, each signal presented to a fault handler by a byte-sliced module will 
have stabilized at a signal value representing either a logical or a logical 1, when the fault handler is 
activated. The outputs of a module can thus always be interpreted by a fault handler as a bit vector 
specified either by the bit vector function associated with the module or by die functional 
composition of this bit vector function with a fault function. Faults can then he detected or masked 
by designing the fault handler to operate on its input bit vectors based on the capabilities of the 
encoding scheme. 

Consider next the application of byte slicing and encoding schemes to deal with failures in 
redundant packet communication systems. Fust of all, there is die problem of modeling hardware 



-23- 



failures in packet communication modules, and how to design decoders, under these models, to 
decode inputs from fault-free sliees properly, in smte of m*erfewnee from failed slices. This is the 
problem of fault modeling and will be discussed in the next subsection. In this subsection we explain 
the problems of keeping fault-free slices in a redundant module properly synchronized and consistent 
so that their outputs can always be grouped together properly for decoding. We call these latter two 
problems redundancy management problems. 

The synchronization requirements can be illustrated by analyzing the operating principles 
behind the synchronous byte-sliced organization, commonly known as a triple modular redundancy 
scheme, shown in Ftg 1.2. In this organization, a redundant module is constructed using three 
identical slices, and the decoders are majority voters. Failures in any single slice are masked in 
majority voters in successor modules. Operation of die majority voters and processing modules are 
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Fig 1.2. A synchronous triple modular redundancy scheme. 
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synchronized through clock signals. Two implicit assumptions have been made in using the 
synchronous triple modular redundancy scheme to mask single module failures: 

(1) The tune skew in distributing a dock signal to the processing modules is bounded by a known 
constant 

(2) All fault-free processing modules in a triad win process identical inputs and generate the 
corresponding outputs in approximately die same amount of time. 

When Assumption (1) is satisfied, processing modules in each triad will be activated to process a new 
batch of inputs from their input voters within a known time interval. Suppose the processing 
modules are activated to process new inputs by clock pulse C. Then under assumption (2), it is 
possible to calculate the time interval, in terms of die number of clock pulses, after which all 
fault-free processing modules activated at C will have presented their outputs to voters in successor 
modules. These voters are activated after tins time interval to perform their fault-masking function. 
If either assumption is not met, it is possible for a voter to be activated by the clock signal before all 
input signals it receives from fault-free processing modules have stabilized. An eweneous datum may 
then be propagated beyond me voters. Tte synchronization requirement for a synchronous triple 
modular redundancy scheme to operate properly in masking single slice failures is that whenever a 
majority voter is activated to perform majority voting, the input signals it receives from fruit-free 
processing modules must have stabilized to their proper values. This requirement defines the desired 
timing relationship among fault-free modules in a redundant system, and is assured in a synchronous 
system through using a property designed timing i 



In a self-timed system organized by a packet communication architecture, a majority voter in a 
triple modular redundancy scheme is activated upon receiving input packets from processing 
modules. During fault-free operation, a majority voter receiving input packets from processing 
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modutes can be activated after receiving one packet from each processing module. There are 
hardware failures, however, which can cause a processing aiodtili to stop generating outpnt packets. 
We call these failures killing faults. To detect and mask tilling faufts affecting a single processing 
module, a majority voter cannot always wait for an input packet from each processing module. After 
receiving input packets from two modules, it must decide at some point that the other module has 
failed, and start to generate alarm signals and construct the error-free data from the packets it has 
received. When it starts to decode the input signals, however, it must also be sure that packets from 
all fault-free processing modules have already arrived. Otherwise the same type of errors illustrated 
in Fig. 1.2 may occur. 

We have developed an approach to manage redundancy in packet communication systems for 
fault toteanee through enforcing the tbUowtog sync hrwiizati on constraint: 

All input signals from fault-free modules feeding a decoder (the counterpart of 
majority voters in more general byte-sliced configurations) must have stabilized at men- 
proper values whenever the decoder is activated to decode its input signals. 

We refer to this requirement as the timing synchronization problem since it imposes certain timing 
constraints on fault-free byte slices and decoders receiving inputs from them. In our approach timing 
synchronization is maintained by assuring that fault-free byte slices feeding a decoder will always 
send outputs to the decoder within a fixed time interval of each other. This approach will be 
discussed in detail in Chapter 2. 

At the end of Chapter 5, we deserio*briefly« alternative approach to organize redundancy in 
packet communication systems for fault toteraace and analyze fes relative merits and disadvantages. 
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We have also studied another class of fault tolerance problems that arises in non-determinate 
packet communication systems. The notion of aon-detemmucy and the associated consistency 
maintenance problems are illustrated in die following scenario. 

Consider a data base against which transactions from data terminals are processed. Suppose we 
now replicate the data base at several sites to mask single site failures, and it is desired to maintain 
consistency among them by serializing the transactions received at each site and ensuring that the 
streams of transactions fed to the data bases at these sites are an identical. An input transaction 
received at a terminal is forwarded to all the sites. Non-determinacy arises due to the variation in 
transmission delay along paths between terminals and data base sites, so that two transactions sent 
from different terminals may arrive at these sees in different order. To serialize concurrent 
transactions consistently even when some terminals have failed, data base sites must be able to 
communicate with each other. The fault tolerance problem we shall tackle is how to support such 
communication reliably, under the assumption that a failed data base site or failed terminal may 
exhibit such malicious behavior as sending inconsistent information to other sites. 

Our approach to tolerate failures in non-determinate systems is based on a general algorithm 
for reliable message exchange discovered by Pease et al {45). In our research the much shnpler 
single-fault case was developed independently. The consistency problem is explained in Chapter 2, 
and its solution in Chapter 3 and Chapter S. 

L2.3 Implementation Considerations 

To implement fault-tolerant packet communication systems, we need a fault model which 
characterizes (i) die behavior of fatted modules, and (k) the interaction between hardware dements 
employed in the construction of fault handlers and outputs from failed modules. The latter part is 
necessary for evaluating die effectiveness of fault handlers, but is often ignored due to die fact that 
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the stuck-at fault model for logic gates is most commonly used. Under me stuck-at fault model, the 
output toes for faulty modules are assumed to be stuck at either logical or logical 1, and it is 
straightforward to specify the interaction between fiauk-free logic gates and faulty input signals stueJr 
at either or 1. We will study hardware implementation of redundant packet communication 
systems using the stuck-at fault model a random pulse train fault model and a random wave train fault 
model introduced in Chapter 4. The two latter models are chosen because they better capture the 
sensitivity of self- timed hardware modules to runt pulses and output hazards exhibited by their faulty 
neighbors. Their choice is also motivated by failure mechanisms in VLSI technologies which are not 
adequately modeled by classical stuck-at fault models [28}, (63}. For both die random pulse train 
fault model and the random waif traia fault model we have also specifkAthe interaction between 
signals generated by failed modules and h ard w a re elemcnte used to^lemented fault handtes so 
that the effectiveness of our feuh tolerance techniques in combating hardware failures can be 
vigorously evaluated. 

The problem of having metastable states in bistable devices, also known in the literature as the 
glitch problem or die synchronizer problem [14], [IS], [37], [44] poses additional difficulties in 
implementing fault handlers under the random pube train and the random wave train fault model 
This problem often arises at the interface between a synchronous system and an asynchronous input 
When a latch activated by an internally generated timing signal is used to receive an asynchronous 
input signal the input signal may arrive at the latch almost simultaneously with the clock signal The 
latch may men be driven into its metastable state. Once a latch has entered a metastable state, it can 
reside in that state for an arbitrarily long time. When tiie latch subsequeptiy leaves this metastable 
state, its output may undergo a signal transition and violate other timing assumptions in the system. 
Metastable state phenomena have caused failures in synchronous digital systems. In a redundant 
packet communication system, a decoder must receive signals from every byte slice in a redundant 
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module. Under the random pube train and die random wave train fault models, a failed byte slice 
may deliver faulty signals to drive receiver latches in neighboring decoders into their metastabk 
states, causing these decoders to be out-of-sync by an arbitrary amount, with ncsfinite bound. 

In a self-timed system, another bistable device called an arbiter is often used to resolve conflicts 
in resource sharing. A two-input/one-output arbiter forwards input requests it has received at its 
output port. If two input requests are received simultaneously (within a short period of time), 
arbitration is performed and one of die requests will be forwarded. In performing arbitration an 
arbiter can also be driven into a metastable state. 

To tolerate single arbiter Mures, several of mem must be operated together. If two conflicting 
requests are delivered to two faHure-i rKk pe nden t arbiters within a short interval of each other, these 
arbiters may enter mid reside in their respective metastable sates for different periods of time. There 
is thus no guarantee mat they will bom resolve the conflict within a fixed tune period, this poses 
problems in assuring timing synchronization. The arbiter is also a source of nondeterminacy in 
packet communication systems. In the above scenario, there is no guarantee that the two arbiters will 
resolve the conflicts that arise in exactly the same way, and additional mechanisms are necessary to 
maintain consistency among them, so mat the conflicting requests will be forwarded in die same 
order after arbitration has been performed independently by die two arbiters. 

As we shall see in Chapter S, metastabk sate problems, together with massive failure modes 
that affect several modules simultaneously, pose intrinsic limitations on die reliability that can be 
achieved when a self-timed system is enhanced for fault tolerance using die approach developed hi 
this thesis. 
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13 Related Work 

13.1 Fault-Tolerant Architectures 

Redundant hardware for combating failures in electronic computers has been introduced in 
many different forms, ranging from quadded logic for protecting against single logic gate failures [61] 
to modular redundancy schemes for protecting against failures in entire processors, memories and 
buses. Review articles [53], [54] and [55] give accounts of the development of fault-tolerant 
computing in the past two decades. The pervasive use of computer controlled equipment, the high 
cost of maintenance and the lowered hardware costs brought about by VLSI have prompted much 
interest in improving system availability and maintainability through the use of redundant hardware. 
To date fault-tolerant computers systems are found mostly in telephone switching systems, 
space-borne vehicles, air traffic control and o&er applications where computer failures would incur 
high human or economic costs. 

The fault-tolerant data flow processor design presented in mis diesis uses a dynamic 
redundancy scheme to mask hardwire failures: failures ale detected on-line, the hardware system is 
repaired and the program in execution is restarted. Many computer systems have relied upon 
checkpoint/restart as a standard means of recovering from system failures. The STAR (Self-Testing 
and Repair) computer developed at Jet Propulsion Laboratory [8] is probably the first dynamically 
redundant fault-tolerant computer. In STAR each arithmetic unit, memory unit and data bus is 
protected by using error-detecting codes. Units performing logical operations are protected by 
duplication. When a failure is detected, the failed unit is replaced by a spare through power 
switching. Redundancy is managed by a configuration control unit, which is triplicated for fault 
tolerance. To mask hardware failures, checkpoints must be inserted »to appacation progmns from 
which program execution can be restarted after a fctture has occurred A set of operating system 
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procedures is provided specifically for checkpoint insertion. The STAR computer has been 
breadboarded and exercised through fault injection. It has evolved into the FTSC (Fault-Tolerant 
Spaceborne Computer) systems [S2] currently under advanced development at Raytheon and die 
Aerospace Corporation. 

A different approach to fault masking is adopted in the FTMP (Fault-Tolerant Multiprocessor) 
system developed at Draper Laboratory [30] and in the SIFT (Software-Implemented Fault 
Tolerance) system developed at Stanford Research Institute |67]. In both of these systems a program 
is executed on several hardware units and failures are masked by voting on the results of redundant 
computations. In the FTMP system, failures are masked through the formation of processor triads, 
memory triads and bus triads, and voting on the outputs of each triad with majority voters located in 
bus receivers. Spares are provided for replacing failed units. The SIFT system consists of a network 
of processing modules, each with its own local memory. In SIFT majority voting is performed by 
voting programs which receive results from redundant computations over (he network. The number 
of processing modules dedicated to a task can be varied according to its criticality. In both FTMP 
and SIFT fault detection leads to system reconfiguration, but programs need not be rolled back to 
mask hardware failures. 

Our work on hardware redundancy techniques has benefited from work on fault-tolerant 
clocking systems conducted under the FTMP project [161 and fhxn work on consistency 
maintenance conducted under the SIFT project [45]. 

Many fault-tolerant computers in use have a bus-oriented internal architecture and are made 
fault-tolerant through providing redundant buses, redundant processing elements and redundant 
memory systems. The data flow processor contains a packet network which is absent hi these 
architectures, and in general has rather different reUabflhy and p er forman ce requirements. 
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Methodologies for incorporating fault toteraace fatto Nnh pwforaunGe computer systems and the 
associated technical problems have been discussed in |9| fLG|. We have also found these discussions 
illuminating for our work. 

The interconnection networks in a data How processor are modular, one-way, packet routing 
networks. Each network serves a set of source modules and a set of destination modules (which need 
not be distinct). Packets are generated at sonrce modules and delivered by die network to destination 
modules. Nodes in the networks are designed to forward packets only in the direction of packet flow, 
from sources to destinations. Other multiprocessor organisations, C.mmp fS8] Hx example, use 
commutation networks feT m terproces s o r communication. A commutation network is one which can 
be set up to satisfy a set of connectivity requirements between its input nodes and output nodes, but 
is otherwise passive. The fault tolerance properties of reveril chuses of commutation networks have 
been studied in f3Q Dtis work deals only win die to p o tegkt l aspects of fiault-tolenint network 
design. For each class of commutation networks stodtod\ I is shown torn redone 
incorporated so that faulty subnetworks can be bypassed. Several other redundant commutation 
network configurations are described in f42J; » pmt of » ft^-^olenmt iasodati\e pmcxssoT deaga. 
Neither of these works addresses the problems of hardware fault detection and coordinating network 
reconfiguration with recovery. 

Packet routing using Banyan networks is proposed by Tripatiii in f60J. He points out mat a 
faulty node in a packet switched Banyan network can be bypassed if bidirectional paths are provided 
between network nodes, but gives nfr fedundancy tec hn iq u e t or recovery strategies to support tins 
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1 3.1 SynchroBiwtUon and Consistency Mateeaaace 

Our approach to maintaining timing synchronization is to assure that fault-free byte slices in 
the same redundant module will always deliver outputs to decoders receiving signals from them 
within a fixed time interval of each other. The techniques we have developed for this approach arc 
closely related to techniques proposed for implementing fault-tolerant clocks. In a fault-tolerant 
clock several oscillators are operated in parallel to generate a set of periodic signals. These 
independently generated clock signals must, however, satisfy some in-phase requirements. One such 
requirement is that they must all change state wjtitin a fixed tmie interval of each other, just like die 
one we have imposed on byte slices in the same redundant module to maintain tinting 
synchronization. A fault-tolerant clock design has been presented in [16J. In that design me circuitry 
is tailored to generating periodic signals. Wc wiD show how a fault-tolerant clock design can be 
derived from our synchronization techniques. Our synchronization techniques are also related tome 
synchronization voting technique given by Davie* and Wakerly 117J. Davjcs (18). £9) has applied 
synchronization voting to fault-tolerant clock design, syirhreniialion among redundaitt processing 
units, and synchronization among redundant miCTocoraputers. W* wiU also explain how timing 
synchronization can be achieved using synchrojuzation voting under restricted fauk assumptions. 

Consistency maintenance problems in die fault-tolerant space shuttle computer have been 
discussed informally to a paper by Sheridan [561 ititi»s^o|imu|er ! 8jsa^ me same process may run 
on several processor-memory units simultaneously. Whoa an asynchroaous interrupt occurs, the 
process jnay be intejr^tcd at differ Processes in die 

system communicate via shared memory. Suppose that an input interrupt occurs when a process is 
writing into a data area it shares with the input process, and that in one processor-memory unit it has 
finished writing into this area when it is interrupted while in another processor-memory unit it.has 
not The input process may then read different, versions of the shared data in different 
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processor-memory units, and may then exhibit inconsistent behavior on these units. Sheridan 
describes a set of mechanisms for maintaining data consistency in this environment Process 
incarnations running on different computers are required to synchronize with each other before 
accessing shared data. The fault model assumed and the effectiveness of this solution were 
unfortunately not discussed in any detail in [56]. 

In the SIFT system, inputs from different sensors, outcomes of diagnostic tests and internal 
clock readings are exchanged among processing modules on the network. It is assumed that a failed 
module may give inconsistent information to other modules. Pease et al |45] have developed a 
general algorithm for exchanging information among the processing modules so that 

(i) the information transmitted by a fault-free module w&T be known to every fault-free module, and 
(ii) the fault-free modules win agree on the contents of messages transmitted by faulty modules. 

This algorithm requires at least 3f+l modules to tolerate up to f module failures among them. A 
negative result which states that this fault tolerance capability cannot be achieved whh less man 3f+ 1 
modules is also proved in (45). 

For our work we have independently developed the solution to die much simpler case in which 
at most one module may fail and each message contains the outcome of a binary decision. We have 
also studied the hardware implementation of this much simpler case in the self-timed systems 
environment 



L4 Synopsis 

In the following chapters, we first develop an approach to manage redundant hardware in a 
packet communication system for fault tolerance, and men present the design of a fault-tolerant data 
flow processor. Basic concepts and redundancy techniques toe illustrated with hardware modules 
taken from the data flow processor design. Readers who wish to study die architecture of die 
fault-tolerant data flow processor before analyzing redundancy techniques for its implementation can 
also read the material covered in mre thesis in that order. 

Our approach to incorporate redundant hardware into packet communication systems for fault 
tolerance is introduced in Chapter 2. This approach is based on maintaining timing synchronization 
and consistency in redundant packet communication systems, even after, hardware failures have 
occurred. The problems of timing synchronization and consistency maintenance are explained using 
redundant packet communication systems constructed with byte-sliced packet communication 
modules. Concepts and terminology for discussing these problems and their solutions more precisely 
are also presented in mis chapter. 

Robust algorithms for maintaining timing synchronization and consistency among byte slices in 
a redundant packet communication module are presented in Chapter 3. tinder these algorithms, 
byte slices in the same redundant module send signals and messages to each other. A failed slice may 
send signals and messages of arbitrary contents at random. These algorithms are robust in that timing 
synchronization and consistency can be maintained in spite of hardware failures. 

Hardware implementation of these redundancy management algorithms and decoder designs 
are studied in Chapters 4 and S. In Chapter 4 we introduce a class of asynchronous packet 
communication protocols and three fault models. An asynchronous packet communication protocol 
defines the behavior of fault-free packet communication modules. A fault model characterizes the 
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behavior of failed modules as wdl at the interaction between sonata generated by these failed 
modules and fault-free hardware elements used to construct finik handlers. In Chapter 5 we discuss 
how to implement fault handlers, and other hardware modules, to support the fault tolerance strategy 
developed in Chapters 2 and 3. These hardware modules are designed to maintain tuning 
synchronization and consistency, and to detect and/or mask hardware failures, under die fault 
models of Chapter 4. 

A dynamic redundancy scheme for masking hardware failures in a multiprocessor architecture 
designed to execute parallel programs organized by data flow principles is presented in Chapter 6. 
Novel features of this computing system include its packet Mmmurucatwn architecture, use of packet 
networks to support communication among processing elements and dynamic allocation of a 
homogeneous set of specialized functional units to service requests. Program organization and 
hardware module designs to support the dynamic redundancy scheme are explained. Strategies to 
incorporate additional modules and data paths into packet networks to support rapid repair are also 
dcjc.'ibsd. 

Application of the hardware redundancy concepts and techniques developed in Chapters 2 
through 5 are illustrated by module designs given in Chapters 5 and 6. 

A summary of results and coiKludii$ remarks aie given m Chapter 7. 
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1 Timing Synchronization and Consistency Maintenance in Packet Communication 
Systems 

The problems of timing synchronization and consistency maintenance have not received much 
attention in the literature of fault-tolerant computing. These problems arc. explained in detail in this 
chapter. Solutions to these problems, implementation techniques and hardware module designs are 
studied in subsequent chapters. 

The concepts of byte slicing and output encoding have been introduced in Section 1,1, Timing 
synchronization and consistency maintenance problems will be illustrated with a byte-sliced 
organization for redundant packet communication modules. This organization is presented in 
Section 2.1. 

Wc remind the reader that the basic problem in timing synchronization is to assure that 
whenever a decoder is activated to decode its input signals, ail input signals from fault-free byte slices 
must have stabilized at their proper value. Our approach to maintain timing synchronization is to 
develop a set of mechanisms which can maintain proper synchronization. Informally, proper 
synchronization is achieved if bytes in the same input word, and acknowledgments for an input byte, 
arc always delivered by fault-free slices in a redundant module to a neighboring decoder within a 
fixed time interval of each other. In Section 2.2 we introduce seme concepts and terminology for 
characterizing proper synchronization among byte slices hi a redundant packet communication 
module more precisely. 

To use a byte-sliced organization and output encoding for fault tolerance, outputs generated by 
byte slices must also be consistent, in addition to being properly synchronized. Inconsistency among 
bytes in an output word generated by a byte-sliced module is due to non-determinate behavior. A 
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packct communication module is determinate if, starting flom a given initial state, the set of output 
packet sequences it generates is a function of the set of input packet sequences it has received, 
independent of the order of input packet arrivals at its different input ports. A non-determinate 
system may generate different sets of output sequences after receiving the same set of input 
sequences, depending on the order in which packets arrive at its different ports, and on the outcomes 
of internal conflict resolutions. A determinate module has the nice property that its behavior is 
reproducible in both time and space. Two fault-free copjes of a determinate module will generate 
identical output sequences when fed identical input sequences. Their outputs can hence be 
compared to detect failures among them. Transient failures can also be tolerated by feeding die same 
input sequence to a determinate module several times in succession, each time starting from the same 
initial state. 

For now we illustrate how non-determinacy introduces inconsistency by another simple 
example. Suppose in a byte-sliced module we have two copies of a non-determinate module which 
increments an integer input either by one or by two. If these two module copies each receives the 
integer 3, and then one copy outputs 4 and the other 5, we cannot deduce by comparing their outputs 
alone whether a hardware failure has occurred or whether die two module copies have acted 
differently due to non-determinacy. Non-detenwnacy thus introduces additional complications in 
fault tolerance considerations. A primitive for introducing non-determinacy in packet 
communication systems and the associated c oawten c y and fault tolerance problems are discussed in 
Section 2.3. 
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11 By te-slfced Packet Coawunkatton Modules 

A packet communication module interacts with its environment by transmitting and receiving 
packets at its ports. Each port consists of two sets of wires, carrying packets and acknowledgments 
into and out of the port, respectively. A packet communication channel (Fig. 2.1a) is constructed by 
connecting an output port of one module to an input port of another module. 

Packet transmission over a channel can be described in terms of channel state transitions (Fig. 
2.1b). Port activities cause a channel to alternate between an active state and a passive state. In an 
active state a packet is available over the channel from the sender output port In this state the 
receiver input port is enabled. When die available packet is recerved and another packet can be 
transmitted over the channel, the receiver port resets the channel to its passive state. The sender port 
is then enabled and at some later time will change the channel state from passive to active and 
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transmit another packet over the channel. For die discussions in this chapter it is sufficient to note 
that a feedback mechanism is built into a communication protocol, so that during fault-free operation 
packets and acknowledgments alternate at each channel. Techniques for implementing asynchronous 
packet communication using signal transitions on wines in the packet bundle and the 
acknowledgment bundle will be presented in Chapter 4. 

We next describe the hardware structure of a byte-sliced module and <hc structure of 
connections between two such modules. A slice (Fig. 2.2a) consists of a processing module 
(p-modutc), with each input port protected by a synchronizing decoder and each output port 
protected by a fanout module. We will refer to a decoder or a fanout module as a control module, to 
distinguish it from a processing module. Subsequent presentations are greatly simpHftcd by grouping 
decoders and fanout modules in a redundant , module 4»gether into synchronization sets (Fig. 2.2b), 
according to the input port or output port each protects. For such groupings to be meaningful 
p-modules m a by te-shced module must have me same number of input ports and output ports, and 
exhibit very similar packet processing behavior at these ports. The notion of schematic equivalence, 
as defined next, captures the desired relationship. 

Two packet communication modules are schematically' equivalent if 

(1) They have die same number of input ports and output ports, and 

(2) There exists a numbering scheme for input porte and output pom in each module such that if 
identical input sequences are fed to the I™ input pott of both modules, for all i, then the same 
number of output packets will be generated at the j" 1 output port of both modules, for all j. (For 



1. Note that the notion of schematic equivalence, and that of performance compatibility given later 
on, arc well-defined only for fault-free modules. 
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the hardware structures used in this diesis, the numbering scheme is always obvious from context 
and hence omitted.) 

Copies of determinate modules are schematically equivalent land deliver identical output 
sequences when fed identical input sequences. Copies of non-detctminate modules may not be 
schematically equivalent since the number of output packets generated at an output port of a 
non-determinate module may depend on the order of packet arrival at its different input ports, as 
well as on the contents of the packets it has received. Schematically equivalent modules need not be 
identical. It is straightforward to extend the notion of schematic equivalence to a group of more than 
two modules. V 

We will assume that byte slices in a redundant module are constructed out of schematically 
equivalent p-tmdules. A synchronization set (Fig. 2.2b) contains one control module from each slice. 
The common property among control modules in a synchronization set is that each protects the I th 
input port of the p-module in its slice, or each protects the j m output port of the p-module in its slice. 
Synchronization sets arc enclosed in dotted lines in Fig. 2JL 

Connections between adjacent redundant modules are made by connecting a synchronization 
set of fanout modules in one to a synchronization set of decoders in the other (Fig. 13). Each 
synchronizing decoder receives input bytes from a set of fanout modules. Each input word to a 
synchronizing decoder consists of a number of bytes, one from each fanout module it is connected to. 
Note mat each ank shown in Fhj 2.3 consists of* packet bundle carrying signals in the direction of 
the link, and an acknowledgment bundle carrying signals in the opposite direction. Every byte is 
transmitted as a packet between adjacent modules. Asynchronous protocols are also used by a 
p-module to receive packets from its synchronizing decoders and send packets to its fanout modules. 
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Module X 



Module Y 



Fig. 2.3. Connection between synchronization sets in adjacent redundant modules. 

A synchronizing decoder decodes input words received from a synchronization set of fanout 
modules. A fanout module forwards each packet it receives to die synchronization set of decoders it 
feeds. It also receives acknowledgments from these decoders and then returns acknowledgments to 
its p-module. P-modulcs perform packet processing. Let us examine the operation of byte slices in a 
redundant module in more detail 



Suppose each of the fanout modules in a synchronization set receives a byte from its p-module. 
These bytes together constitute an output word of the redundant module, and are distributed by the 
fanout modules to synchronizing decoders in a synchronization set in a redundant successor module. 
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Each input word received by a synchronizing decoder is thus delivered as a batch of packets, one from 
a different fanout module and each containing one byte of the input word. Failure of a single byte 
slice affects at most one byte of any input word received by a decoder. A synchronizing decoder 
decodes each input word received to derive an output word, which it then delivers to the p-module in 
its byte slice. After some time the decoder will receive an acknowledgment from its p-module. This 
acknowledgment is returned immediately fey the decoder to fanout modules from which it has 
received input bytes. Each fanout module will flius receive a batch of acknowledgments, one from 
each byte slice which has received an input byte from it, and failure of a single slice wifl affect only 
one such acknowledgment. After a fanout module in the synchronization set has received 
acknowledgments from the successor module, it returns an acknowledgment to the p-module in its 
slice. 

For subsequent discussions, we will also find it convenient to group packets and 
acknowledgments received and produced by p-modules in die same redundant module into batches. 
P-modules in aredundant module are schematically equivalent and will always be fed identical input 
packets. Packets and acknowledgments delivered to and generated fey fault-free p-modules in a 
redundant module, under these conditions, can be partitioned into batches, according to the ports at 
which they are received or generated, and die positions they occupy in their respective streams. 
Packets, or acknowledgments, received at the corresponding ports of fault-free p-modules in a 
redundant module, and occupying the same position in their respective streams, belong to the same 
batch. Likewise output words, or acknowledgments, generated at the corresponding ports of 
fault-free p-modules in a redundant module, and occupying the same position in their respective 
streams, belong to the same batch. 
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2.2 Timing Synchronization 

For a redundant packet communication system constructed out of byte-sliced packet 
communication modules, timing synchronization imposes two timing constraints on system 
operation: 

(Tl) Bytes jn the same word generated by fault-free slices must have arrived when a 

decoder is activated to decode its input signals, and 
(12) Acknowledgments for the same output byte generated by fault-free slices must have 

arrived when a fanout module is activated to return an acknowledgment to its 

p-module. 

Our approach to maintain timing synchronization is by maintaining proper synchronization: 
A redundant packet communication system is properly synchronized if 

(PI) Bytes in the same input word are always delivered by fauk-free slices In a 

neighboring module to a synchronizing decoder within a fixed time interval of 

each other, and 
(P2) Acknowledgments for die same output byte are always delivered by fault-free slices 

in a neighboring module to a fanout module within a fixed tine interval of each 

odier. 

If proper synchronization is maintained, and there are enough fault-free byte slices remaining 
in each redundant module, then each decoder can determine whether it has received all bytes in the 
same input word generated by fault-free slices in a neighboring module. And after delivering a byte 
to synchronizing decoders in a neighboring module, each fanout module can determine whether all 
acknowledgments for that byte, generated by fault-free- slices in the neighboring module, have been 
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received. The technique used in these control modules to achieve (Tl) and (T2>, given <P1) and (P2) 
will be presented in the next chapter. 

Timing synchronization, as illustrated in Fqj. 1.2, is achieved in synchronous systems by using 
byte slices which can process identical inputs in approximately the same amount of time, and by 
assuring that each clock signal is delivered to all byte slices in a redundant module within a fixed time 
interval. Our strategy for maintaining proper synchronization is also based on these ideas. 

For a more precise discussion on timing synchronization, wc next define performance 
compatibility for schematically equivalent modules, and in-phase operation for p-modules in 
redundant packet communication systems. The definition of performance compatibility is intended 
to capture the intuitive notion that two schematically equivalent modules are able to process identical 
inputs in roughly the same amount of time. In-phase operation is die approach we have adopted to 
ensure proper synchronization. 

Performance Compatibility 

Two p-modules M and M' are performance eomp«aWk)i 

(1) They are schematically equivalent, and 

(2) If identical words, and acknowledgments, are always delivered to M and Kf as a batch within a 
time interval bounded by p* tbeapackets and acfeiowtedgmeaitin the same output batch will be 
generated by M and M' within a time interval oonndedby T*p* fbrsome fixed r. 



1. We use lower case Greek letters to denote time intervals. 
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It is straightforward to generalize the definition of performance compatibility to a group of 
more than two schematically equivalent p-modules and to a group of control modules. We shall 
assume performance compatibility for determinate p-modules in a redundant module, and for 
control modules in a synchronization set We will often say that two modules arc performance 
compatible within r, and call t an upper bound on performance incompatibility. 

Performance compatibility does not impose any constraint on how long a module may take to 
process an input, but only specifies the timing relationship among outputs delivered by schematically 
equivalent modules. We also note that two copies of a non-determinate module containing arbiters 
are not performance compatible since if either one of them enters a metastabk state, it may reside is 
that state for an arbitrarily long time (Section 123). 

In-Phase Operetta 

A packet communication system maintains in-phase operation of the byte slices in a redundant 
module if input words and acknowledgments in the same batch are always delivered to p-modules in 
these byte slices within a fixed time interval ©# *ach other. 

For in-phase operation, we require that inputs in every batch delivered to p-modules in a 
redundant module be delivered within me some feted time interval. We wiH also say mat the outputs 
of a group of p-modules in a redundant module are in-phase -V- outputs in the same batch are always 
generated by the p-modules within a fixed time interval. 

Our approach to maintaining proper synchronization is to maintain in-phase operation of 
p-modules in fault-free byte slices in every redundant module. A redundant packet communication 
system in which: 
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(i) byte slices in every redundant module arc constructed using performance compatible p-modules, 
(ii) in-phasc operation of these byte slkes k m&Httatned, 

will remain properly synchronized. Wc reason as follows. If in-phasc operation of performance 
compatible byte slices in a redundant module is always maintained, men outputs of p-modules in 
fault-free slices must be in-phase. These outputs are forwarded through control modules in 
synchronization sets to neighboring modules. Since control modules in die same synchronization set 
are also performance compatible, bytes in the same word and acknowledgments for the same byte 
will be generated by these control modules, and hence by die fault-free slices, within a fixed time 
interval of each other. We make the further assumption that communication paths between 
synchronization sets have finite propagation delay. Proper synchronization (Properties PI and P2) 
can then be maintained. 

The synchronization techniques presented in Chapter 3 enable control modules in 
synchronization sets in a redundant module to maintain in-phase operation of byte slices in die 
redundant module locally. The strategy is to let control modules in the same synchronization set 
synchronize with each other before passing die inputs they have received on to the p-modules they 
are guarding. Using tie synchronization techniques developed in Chapter 3, all fault-free control 
modules in the same synchronization set will deliver words or acknowledgments in die sane batcn to 
the p-modules within a fixed dme interval, despite interference from failed modules. 

23 Consistency MaintefiaM* 

Packet communicadon modules send streams of data packets to each other during system 
operation. A module is determinate if the set of output packet streams it generates is a function of die 
set of input packet streams it has received, independent of the temporal order in which input packets 
arrive at different input ports. A module is non-determinate if its output streams may depend on this 



arrival ureter and on die outcomes of interna) conflict resolutions, as well as on die data values carried 
in the input streams it has received. Non-detcrmioacy introduces another dimension to the fault 
tolerance problem, namely, information replicated in redundant hardware must be kept consistent 
The consistency problem is widely studied in distributed data base systems. Sheridan [56] discusses 
the problems of maintaining data consistency during instruction execution on the Space Shuttle 
control computer, with a program structure supporting prioritized multiprocessing and 
intcrrupt-driven input/output Pease et al [45] have given algorithms for achieving interactive 
consistency in exchanging clock readings, sensor readings and diagnosis results among multiple 
computers. They have formulated the consistency problem as the "Albanian Generals Problem" : 

Several divisions of the Albanian Army are camped outside aw enemy city, each division 
commanded by its own general. The generals can communicate with one another only by 
messengers. After observing the enemy, they must decide upon a common plan of action, either to 
attack or retreat However, some of the generals may be traitors trying to prevent the loyal generals 
from reaching agreement The generals must have an algorithm to guarantee that 

(1) All loyal generate decide upon the same plan of action, and 

(2) The loyal generals cannot be persuaded to follow the plan of action contrived of by a small 
number of traitors. 

We have noted that the need for consistency maintenance is a consequence of 
non-detcrminacy. In die Albanian Generals problem, each loyal genera^has to independently decide 
whether to attack or retreat Loyal generals can thus make conflicting decisions, leading to 
non-detcrminacy. 

In this diesis we study consistency and fault tolerance for one specific form of non-0>tcrminacy 
in packet communication systems, introduced by combining two streams into an input stream to a 
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determinate module, using a merge module (Fig. 2.4). If two packets arrive at the two input ports of a 
merge module within a short time interval, the merge module may output them in either order, and 
may furthermore take an arbitrarily long time to decide which packet to output first When input 
packets arrive further apart in time, the merge module outputs them as they are received. 

In a redundant packet communication system, merging two input streams is implemented using 
a redundant module containing merge modules as p-modules in its byte slices. In this setting, 
identical input packets are always delivered to merge modules in the same redundant module within 
a fixed time interval, as a result of maintaining, in -phase operation in the system. The consistency 
problem is to ensure that the output streams of non-faulty merge modules in a redundant module are 
identical. The fault tolerance problem is to ensure consistency and 4n-phase operation in the 
presence of hardware failures. The merge modules used in a redundant module also exchange 
messages with each other to achieve consistency. We shall call them synchronizing merge modules in 
the sequel. 




Fig. 2.4. Stnictureofanon-determinate module. 
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The strategy to achieve fault tolerance is to allow synchronizing merge modules to exchange 
messages with each other, informing each other of toe input port iron) which a new packet has been 
received, and men reach an agreement with other synchronizing merge modules on which input port 
to use for forwarding the next packet Note that it is not sufficient to receive a message from each 
synchronizing merge module in the redundant module and then perform majority voting on their 
contents to pick an input port. This is because die fault-free synchronizing merge modules can be 
split up just about evenly on which of two input ports to use. and faulty synchronizing modules can 
send different messages to different Fault-free synchronizing modules, leading them to different 
majority vote decisions. We present an implementation of this strategy based on an algorithm for 
exchanging messages reliably among synchronizing merge modules in Chapter 3. This algorithm has 
been described by Pease et al f45] while we have developed the much simpler case for dealing with 
single synchronizing merge module failures independently. 

14 Discussion 

We have introduced two redundancy management problems for self-timed packet 
communication systems in this chapter: timing synchronization and consistency maintenance. These 
are redundancy management problems in die sense that these problems arise only when hardware 
functions are implemented using submodules which mutt be failure-independent, and hence operate 
independently, and yet their outputs must be correlated in some well-defined manner, such as by an 
encoding scheme. If hardware is immune from failures, hardware functions can be implemented in 
packet communication modules using techniques described in {4], [31], [33], [39] for which such 
problems do not arise. For applications requiring fault tolerance, we need solutions to these 
problems which arc robust, so that timing synchronization and consistency are maintained in spite of 
hardware failures. By maintaining proper synchronization and consistency in a redundant packet 
communication system, a concrete foundation is provided for using byte slicing and encoding 
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techniques to detect and/or mask byte dice failures in self-timed hardware systems. 

The problems of timing synchronization and consistency maintenance are illustrated with the 
byte-sliced module organization presented in Section 2.1 for redundant modules. Timing 
synchronization will be a problem in any fault-tolerant computer system constructed out of 
independently clocked processor-memory units. Consistency maintenance arises in sections of 
fault-tolerant computer systems that receive asynchronous inputs. These problems have not received 
much attention in the literature, perhaps due to their infrequent occurrence. It is nonetheless 
important to understand the nature of these problems, so that fault tolerance mechanisms for 
handling them are omitted only through judicious choice, and not from ignorance. 

Timing synchronization is also a fundamental problem facing the designers of fault-tolerant 
clocks. In a fault-tolerant clock several oscillators are operated in parallel to generate a set of periodic 
signals. These independently generated clock signals must, however, satisfy some in-phase 
requirements. Synchronization can only be achieved by exchanging signals among oscillators, and 
must be maintained when a faulty oscillator stops oscillating or starts making signal transitions at 
random. The timing synchronization algorithm presented in Chapter 3 may be viewed as a 
generalization of the techniques described in [16] for implementing fault-tolerant clocks. 

In a synchronous system intermodule activities are synchronized by referencing a common 
timing signal. After receiving an input each module is required to generate the corresponding output 
within a fixed time interval defined with respect to the timing signal. This implies performance 
compatibility among byte slices in a redundant module since they all must satisfy the same 
input/output timing constraint In-phase operation is ensured as a consequence of the implicit 
assumption of negligible variations in propagation delay in distributing die timing signal from a 
common source to its numerous destinations. Fault handlers, by synchronizing with other modules 
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through this same timing signal, need only deal with data faults and code violations. 

The input/output specification of a self-timed module imposes only sequencing and causality 
constraints on an implementation. For fault tolerance we have to reintroduce the notion of 
performance compatibility and to incorporate mechanisms to maintain in-phase operation in 
self-timed systems. Time metrics are thus reintroduced into an otherwise completely self-timed 
system. The additional complexity needed to deal with inconsistency is also a consequence of 
adopting the self-timed discipline. This is because arbitration is not needed in a synchronous system: 
conflicting requests can always be resolved through a predefined priority assignment 

Our whole approach to redundancy management for fault tolerance is based on maintaining 
timing synchronization and consistency in redundant packet communication systems. A 
methodological issue that arises is whether these two properties arc necessary for tolerating hardware 
failures in packet communication systems. Its seems mat consistency must always be maintained in a 
byte-sliced module, otherwise a packet batch generated by fault-free slices can have arbitrary 
contents, and byte slice failures can lead to undetected and uncorrected errors. The necessity for 
timing synchronization is further discussed at the end of Chapter 5, where we examine an alternative 
strategy and its relative merits and disadvantages. 
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3. Robust Algorithms for Timing Synchronization and Consistency Maintenance 

In this chapter we introduce the basic algorithms used in our approach for maintaining timing 
synchronization and consistency in redundant packet communication systems. We maintain timing 
synchronization by maintaining in-phase operation of byte slices in a redundant system (Section 12). 
We maintain consistency by assuring that fault-free synchronizing merge modules in the same 
redundant module generate identical output streams (Section 2.3). 

The algorithm for timing synchronization is implemented in every synchronization set for 
maintaining in-phase operation. Under this algorithm, control modules in the same synchronization 
set exchange signals with each other after receiving bytes in a new input word, or acknowledgments 
for a previously delivered byte, so that fault-free control modules in that set can deliver packets and 
acknowledgments in the same batch to p-modules within a fixed time interval of each other. The 
algorithm for consistency maintenance allows synchronization merge modules to exchange messages 
with each other to jointly determine the input port from which die next output packet should be 
taken. These algorithms must be robust in that in-phase operation and consistency must still be 
maintained in spite of malicious interference from failed modules. 

For presenting these algorithms in this chapter, we use an abstract fault model under which a 
failed slice may generate signals and messages of arbitrary contents at random, or stop generating 
them altogether. Implementation of these algorithms, and circuit design techniques to assure that 
signals processed under these algorithms in control modules correspond to the abstract fault model 
used in mis chapter, are studied in Chapter 5. The algorithms for maintaining timing synchronization 
and consistency are presented in Section 3.1 and 3.2, respectively. 
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3.1 An Algorithm for Timing SyiKtaNiizaliM 

Let us refine our design methodology given in Section 1.1 foi constructing redundant packet 
communication systems. First of all, the non-redundant system is designed and specifications for 
each module in the non-redundant system are determined. Then an output encoding scheme is 
picked. To construct the redundant system, specifications for the p-nwduks in each redundant 
module arc derived from the functional speciflcatioBt#th&iior^fedjindammodute and the chosen 
encoding scheme. P-modulcs, other than synchronizing merge modules, can then be implemented 
using techniques given m [4J, [31], [33J, [39). Each p-module in a redundant module is extended into 
a byte slice by protecting its input ports and output ports with synchronizing decoders and fanout 
modules. These slices are then connected together by adding communication paths between control 
modules hi the same synchronization set to form a redundant module. 

In our approach to constructing redundant packet communication systems, we design each 
redundant module so that for each module we can specify: 

(1) its functional specification, 

(2) for each synchronization set in that module , an upper bound on the phase difference among 
bytes in the same output word, or acknowledgments for the same byte, generated by mat 
synchronization set This upper bound is by design an intrinsic property of die module, 
independent of any property of its neighbors. 

We also design each control module so that timing synchronization (Properties Tl and T2 in Section 
12) can be achieved once an upper bound for phase difference among packets and acknowledgments 
in the same input batch is known. 



-55 



In constructing a redundant system, two redundant modules are connected together by 
connecting a synchronization set A in one to a synchronization set B in the other, as shown in Fig. 2.3. 
Given the specified upper bound on the phase difference among bytes in the same output word, or 
acknowledgments for the same input byte, generated by synchronization set A, and the variations in 
propagation delay along paths connecting A to B, we can calculate an upper bound on the phase 
difference among inputs in the same batch delivered to B. This derived upper bound is the same for 
all input batches sent from A to B. Likewise an upper bound is determined for batches of packets or 
acknowledgments sent from B to A. These upper bounds on phase differences are then used to 
"adjust" each control module to assure timing synchronization. 

After a redundant system so constructed goes into operation, proper synchronization is 
maintained since the phase difference among output batches generated by a redundant module is 
always bounded by a constant which can be derived from parameters in the redundant module alone, 
independent of its interaction with its neighbors. As long as proper synchronization is maintained, 
timing synchronization is also maintained. 

To support our design methodology, we need techniques for synchronizing byte slices so that 
the upper bounds specified in (2) above indeed exist, as well as techniques for designing control 
modules which can achieve timing synchronization given these upper bounds. In this section we 
develop an approach that can satisfy both of these requirements. In this approach, the basic cycle of 
activities carried out in a control module, regarding synchronization, consists of several steps: 

For a synchronizing decoder, 

(1) receive an input byte from each fault-free byte slice it is connected to, timing out faulty slices, 

(2) synchronize with other decoders in its synchronization set, 

(3) forward a word, deduced from the input bytes it has received, to the p-module it guards, 

(4) receive an acknowledgment from .the p-module and return the acknowledgment to all byte slices 
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it receives bytes from. 

For a fanout module, 

(1) receive a byte from the p-module in its slice, and forward the byte to all decoders in me 
synchronization set k feeds, 

(2) receive an acknowledgment from each fault-free byte slice it is connected to, timing out faulty 
shees, 

(3) synchronize with other fanout modules in its synchronization set, 

(4) return the acknowledgment to the p-module. 

The synchronization algorithm is implemented in synchronizers m control modules. A 
synchronizing decoder (Fig. 3.1a) consists of a synchronizer section and a decoder section. The 
synchronizer section implements me synchronization algorithm. The decoder section decodes the 
input word. The synchronizer section consists of a number of byte detectors (BDs) and a 
synchronizer. The byte detectors are designed to receive packets in me adopted communication 
protocol. Asynchronous packet communication protocols are discussed in Chapter 4.1. For this 
chapter it is sufficient to note that arrival of each new byte at a byte detector causes a signal to be sent 
on its output line to the synchronizer. Arrival of a new input word is signaled to the synchronizer as a 
batch of signals, one from each byte detector. By using byte detectors, the synchronizer can be 
designed independently of the packet protocols adopted. The synchronizer generates an output 
signal for each new batch of input signals it receives. In a synchronizing decoder, this signal activates 
the decoding section. Outputs of the decoder section are subsequently delivered to the p-module in 
its byte slice. Each acknowledgment received by die synchronizing decoder is returned to the fanout 
modules from which it receives input bytes. 
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Fig. 3.1. Hardware structure of control modules. 



Operation of a fanout module (Fig. 3.1b) is similar to that of a synchronizing decoder, except 
that no input decoding is necessary, and acknowledgment signals returned by a neighboring module 
are received by its synchronizer directly. In a fanout module, the output signal of the synchronizer is 
returned as an acknowledgment signal directly to the p-module in its slice. 
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To support the requirements of our design methodology, k is sufficient for die synchronizers to 
guarantee that: 

(51) For each new batch of input signals received, all fault-free synchronizers in the same 
synchronization set will generate output signals within a fixed time interval of each other. The 
duration of this interval is bounded by a constant which can be calculated from the time it takes 
a synchronizer to perform basic operations, and each basic operation can be performed in a fixed 
time. 

(52) If input signals in the same batch arc generated by fault-free neighboring slices within a fixed 
known time interval of each other, men for each new input batch, a synchronizer will generate its 
output signal only after it has received all signals in mat batch generated by fault-free slices. 

In this section we describe a synchronization algorithm for developing a synchronizer 
implementation with these properties. For itas description, we assume oat: 



(Al) Each redundant module is constructed out of 3f + 1 byte slices, up to f of which may fail. 
(A2) Input signals in the same batch, generated by fault-free neighboring slices, are delivered to a 
synchronizer within S of each other. 

We will also use an abstract fault model which states mat a faulty byte slice can send signals, 
cimer directly or indirectly through byte detectors, to a synchronizer at random. In particular, a 
failed module may stop sending signals altogether. 

Since each control module communicates with its neighbors via a handshake protocol, the 
interaction between a synchronizer and its neighboring fault-free slices obeys the constraint that none 
of these neighbors will deliver another signal to a synchronizer until after the synchronizer has 
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generated an output signal in response to the previous batch of signals it has received from them. 
Operation of a (fault-free) synchronizer under the synchronization algorithm, after it has just 
delivered an output signal in response to the previous batch of input signals, is as follows: 

For each batch of input signals received, a synchronizer sends exactly one signal to every 
synchronizer, including itself, and generates exactly one output signal. A synchronizer sends signals 
to other synchronizers and to itself only after it has determined that an input signal from a fault-free 
neighboring slice has been received by some synchronizer. Due to die assumption that up to f 
neighboring byte slices and f synchronizers may fail together, a synchronizer cannot be certain of this 
until after it has received input signals from more than f neighboring slices, or synchronization signals 
from more than f synchronizers in its synchronization set. As soon as a synchronizer has received 
input signals from f+ 1 distinct neighboring slices or synchronization signals from f-t- 1 distinct 
synchronizers, it starts sending synchronization signals to itself and to other synchronizers. Once a 
synchronizer has received signals from 2f+ 1 synchronizers (itself included), it waits 5 seconds and 
then delivers an output signal. 

D 

Suppose a fault-free synchronizer receives synchronization signals from f+1 distinct 
synchronizers at t and as a result generates synchronization signals to other synchronizers, we assume 
that all synchronizers will receive these signals by t -I- p, where p is an upper bound on propagation 
delay through the corresponding data paths. 
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Lemma 3.1 Under assumption (Al) and (A2): 

(i) the output signals of the fault-free synchronizers are in-phase wkhin a fixed time interval, 

namely, 2p, and 
(ii) No non-faulty synchronizer will deliver its output signal before receiving a signal from 

every non-faulty neighboring slice. 

Pnxf: 

(i) Suppose a non-faulty synchronizer receives 2f + 1 synchronization signals at time tj and delivers 
its output signal at tj+8. At tj k must have received signals from at least f+1 non-faulty 
synchronizers. At time tj+p every non-faulty synchronizer will have received at least f+1 
signals from these non-faulty synchronizers, and wiO respond by sending signals to other 
synchronizers if it has not already done so. At tj+2p each non- faulty synchronizer will have 
received at least 2f+l signals from other synchronizers and at tj+2p+S each will have 
delivered its output signaL 

(ii) When a non-faulty synchronizer has received synchronization signals from f+1 distinct 
synchronizers, at least one non-faulty synchronizer wffl have received input signals from f+1 
neighboring slices, and at least one of these slices must be non-faulty. Thus under die 
assumption (hat signals in die same batch are delivered by fault-free neighboring slices within 8, 
when a non-faulty synchronizer delivers its output signal 8 time units after it has received 
synchronization signals from 2f+l distinct synchronizers, aH input signals from non-faulty 
neighboring slices must have arrived. 

D 

Note mat Lemma 3.1 holds as a result of die actions of non-faulty synchronizers. The 
algorithm guarantees in-phase operation regardless of the faulty behavior exhibited by failed 
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synchronizers. Even though up to f faulty neighboring slices and f faulty synchronizers can send 
signals to synchronizers at random, they cannot trigger synchronization and output activities in 
fault-free synchronizers. And even if they stop sending signals, they cannot delay such activities 
indefinitely. To support in-phase operation in a packet communication system with cyclic paths, 
output signals from non-faulty synchronizers must be in-phasc within a fixed bounded interval after 
synchronization. For the above strategy this phase difference is bounded by 2p. To satisfy (SI), 
however, a synchronizer must be implemented in such a way that the operation of generating 
synchronization signals after receiving such signals from f + 1 distinct synchronizers takes a fixed 
amount of time, independent of the phase difference among inputs to the synchronizer. 
Implementation issues arc discussed in detail in Chapter S. 

In the proof of Lemma 3.1 we have assumed that synchronizers act instantaneously. If 
synchronizers do not act instantaneously, variations in logic and path delays through them is bounded 
by some t. Phase difference among their outputs may then increase by t, again a time parameter 
which depends only on hardware parameters internal to the synchronizer set The phase difference 
among synchronizer output signals in the same batch can still be bounded by a parameter applicable 
to all batches, and cannot grow without bound. 

We can also develop some insight on why our timing synchronization algorithm requires 3f+l 
synchronizers to tolerate up to f failures among them. Assume we have a total of S synchronizers. A 
synchronizer must generate an output signal upon receiving signals from (S-f) synchronizers 
(including itself). This is because up to f synchronizers may fail and stop sending signals altogether. 
When a synchronizer generates an output signal, the other synchronizers should start sending signals 
to each other. We observe that if a synchronizer has received signals from (S- synchronizers, f of 
these synchronizers may be faulty and hence may not send signals to other synchronizers. Thus when 
a synchronizer generates an output signal, other fault-free synchronizers may have received signals 
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from only ((S-f) - f) synchronizers. Oar strategy thuscaHs for a synchronizer to send sigiials to 
other synchronizers upon receiving signals from ( (S-f) - f) synchronizers. A faulty synchronizer 
may send signals to odier synchronizers at random. To prevent f faulty sync h ron ize rs conspiring 
together from controlling the behavior of fault-free sy n ch roni ser s, we must have( (S-f) - f ) > f, 
Le.,S>3f. 

A similar analysis will show that it is sufficient for a synchronizer to receive input signals from 
2f+ 1 neighboring slices to tolerate up to f failures among these slices under the abstract fault model 
We have chosen to assume that a synchronizer receives input signals from 3f+l neighboring slices to 
simplify me presentation. The construction of redundant modules using less man 3f+l slices is 
further discussed in Section 3 3. 

3.2 An Algorithm for Consistency Maintenance 

A scenario for studying consistency maintenance problems in redundant packet communication 
systems was given in Section 12. In this scenario non-determinacy is m&oduced using ^mrorang 
merge modules as p-modules in byte slices in a redundant module. In such a non-determinate 
module all control modules in the same synchronization set will use the timing synchronization 
algorithm presented in Section 3.1 to maintain in-phase operation of die synchronizing merge 
modules. Input words and acknowledgments in the same batch wfll thus be delivered to 
synchronizing merge modules in die same redundant module within a fixed time interval. 

When every synchronizing merge module has a packet pending for output from one of its input 
ports, which we shall call die module's request source, they exchange messages with each other and 
jointly pick an input source. Each synchronizing merge module then forwards a packet from the 
input source to its fanout module. The synchronizing merge modules can choose any input port 
proposed by a fault-free synchronizing merge module to be the next Input source, since packets wfll 



63 



be delivered to every synchronizing merge module at that input port within a fixed time interval. 
During the message exchange, faulty synchronizing merge modules may lie. Non-faulty 
synchronizing merge modules must be able to pick the same input source, in spite of liars. As long as 
fault-free synchronizing merge modules agree on their input sources, their output streams will be 
identical. 

Note that this problem is different from the timing synchronization problem, in that modules 
participating in the synchronization activities must reach agreement on the outcomes of two-way 
decisions, rather than on when events have occurred. Our approach to solving this fault tolerance 
problem is based on an algorithm for exchanging messages among synchronizing merge modules, 
under which: 

(1) For every synchronizing merge module M, faulty or not, all fault-free synchronizing merge 
modules will agree on M's request source. 

(2) The request source proposed by a fault-free synchronizing merge module will be known to all 
other fault-free synchronizing merge modules and used in their decisions to pick an input 
source. 

(1) assures that if each fault-free synchronizing merge module applies the same algorithm to 
pick an input source, based on the set of request sources it has deduced for other synchronizing 
merge modules, then all fault-free synchronizing merge modules will pick the same input source. (2) 
assures that the chosen input source indeed has packets pending, so that faulty synchronizing merge 
modules cannot cause all fault-free merge modules to pick one input port as their next input source 
while these fault-free merge modules have only received packets at the other input port 

An algorithm for exchanging request sources among a set of synchronizing merge modules to 
achieve (1) and (2), using 3f + 1 synchronizing merge modules to tolerate up to f failures among them, 
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is given by Pease ct al in [45]. In this algorithm messages are exchanged in rounds, and die number of 
rounds of exchange grows linearly as the number of failures to be tolerated. This contrasts with our 
timing synchronization algorithm Hi which only one round of signal exchange is ever necessary. 
Pease ct al have also proved a negative result which says that fewer than 3f+l synchronizing merge 
modules arc not sufficient for tolerating f failures among them, if (1) and (2) are to be achieved by 
exchanging unauthentkatcd messages alone. 

For this diesis we wilt only discuss the single synchronizing merge module failure (f = 1) case 
in detail. This grcady simplifies we presentation, especially the implementation details, it is also the 
most important case for die maintained etrvironment envisioned for our physics simulation 
applications, in which the mean time to repair is relatively short compared to the mean time to failure 
for any redundant module. 

We will thus assume that there are four synchronizing merge modules, one of which may be 
faulty. We start at the point at which each synchronizing merge module has received a packet from 
an input port, called its request source, and present die algorithm the synchronizing merge modules 
use to jointly pick the input port, called die input source, for transferring die next packet to die 
successor module. We assume that a failed synchronizing merge module may send messages to other 
synchronizing merge modules at random, and that the messages they send can specify either input 
port as its request source. The algorithm also calk for tuning out other synchronizing merge modules 
to avoid waiting for messages from faulty modules indefinitely. To set up these time out 
mechanisms, we need to know die phase difference among inputs in die same batch to the 
synchronizing merge modules, and take metastable state phenomena into account The details of 
implementing these mechanisms will be discussed in Chapter S. 
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Each synchronizing merge module executes the following steps to determine the new input 
source. Request sources and input sources are denoted by either @ or 1. 

(i) Broadcast its request source to all synchronizing merge modules, including itself. 

(ii) Receive messages identifying meir request sources from other synchronizing merge modules. 
This step is timed-out and either fi or 1 is assumed for a late module. 

(iii) Relay messages received in Step (ii) to other synchronizing merge modules, including itself. 
These messages have the format: 

"I am module i (1 < i < 4). Synchronizing merge module j (1 ^ j < 4) told me 
that k has a request from input port k(Q or D". 

(iv) Receive messages sent by synchronizing merge modules at Step (iii). This step is also timed-out 
and arbitrary values are again assumed for request sources in missing messages. Each 
synchronizing merge module i will have received four messages regarding the request source of 
module j, one from each synchronizing merge module (including itself). The contents of these 
four messages are entered into a row of a message table (Sample message tables are given in Fig. 
3.2). Each element in this table is the request source which die synchronizing merge module 
named in its column heading claims to haye received from the synchronizing merge module 
named in its row heading. Module i picks the request source for module j by ignoring the (j, 
j)-th entry in the table and voting on die remaining entries in the row named by j. 

(v) Having determined four request sources, one for each synchronizing merge module, the one in 
majority is taken as the input source for the next output packet In case of a 2-2 tie, pick Q over 
1. 
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This algorithm is illustrated by the example given in Fig. 3.2. Instead of numbering die 
synchronizing merge modules, we label diem A, B, C and 0. The soW-laulty synchronizing merge 
modules A, B and C receive messages from request sources 1, 1 and Q, respectively. Messages sent by 
D can have arbitrary values. The last column in the table kept by a module gives the request source 
determined by dial module in Step (iv) for each synchconizing Merge module The extra box in die 
lower right hand corner of each table gives die input source deduced to Step (v) from the request 
sources. Note that die contents of messages sent by D in Step (iii) does not affect any decision made 
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by A, B and C in reaching agreement on an input source. The reader should also check that 
properties (1) and (2) stated at the beginning of this section arc indeed satisfied 

The same tables shown in Fig. 3.2 will have been constructed if either one of the three modules 
A, B, or C times out module D in Step (ii) and assumes the corresponding value for D's request 
source. In this case. A, B and C would have picked the same input source in exactly the same way. 

We next verify the algorithm informally. If a synchronizing merge module M is fault-free, all 
fault-free synchronizing merge modules (at least three, including M) will broadcast M's request 
source to each other, and hence will agree on M's request source among themselves. If M has failed, 
the information exchanged among the three other synchronizing merge modules regarding M's 
request source will be consistent And again all non-faulty synchronizing merge modules will agree 
on M's request source. The reader is referred to (451 for the general algorithm and for a correctness 
proof. 

33 Discussion 

In the absence of a global timing reference, the timing synchronization algorithm presented in 
Section 3.1 allows control modules to maintain in-phase operation locally. This algorithm is closely 
related to the fault-tolerant clock design described in fMJ, and may be viewed as a generalization of 
die synchronization voting technique studied by Davies and Wakerly in [17]. These two references 
have been sources of ideas and examples for our work. 

Fault-tolerant Clocking Systems 

In the fault-tolerant clocking system described in [16] an array of identical oscillator modules is 
used to produce a number of phase-locked clock signals for global synchronization. Using 3f+l 
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modules, phase-locking is maintained on at least 2f+l of tfie global dock signals after f failures. 
Local conditioning circuits at hardware units convert the 3f+ 1 global dock signals into local clock 
signals that are phase-locked if no more than f global clock signals have failed. Phase-locking among 
the global clock signals is maintained by conditioning each oscillator output to change state after 
either the elapse of a clock period or after sufficiently many other global clock signals have changed 

Our synchronization algorithm presented in Section 3.1 can be used to implement a 
fault-tolerant clock in the configuration depicted in Fig. 3.3. In this scheme each synchronizer has a 
single input port and sends signals to other synchronizers as soon as it receives an input signal from 
this port. The output signal generated by the synchronizer is fed back to itself. The output signals 
generated by fault-free synchronizers can be out of phase by as much as 2p (Section 3.1). Each 
synchronizer, after receiving synchronization signals from 2f+ 1 distinct synchronizers (including 
itself), waits 2p before sending the next output signal 
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Fig. 3.3. A fault-tolerant docking system. 
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Suppose each signal transition (either O-+lorl-*0)onits output line represents an output 
signal generated by the synchronizer, then the configuration illustrated in Fig. 3.3 has the following 
properties: 

As long as less than f of the synchronizers have failed, 

(1) AH fault-free synchronizers will generate output transitions in the same direction within 2p of 

each other. 

(2) A fault-free synchronizer will make an output transition in one direction after all fault-free 
synchronizers have made their previous output transitions in die opposite direction. 

These are exacdy the phase requirements to be satisfied by redundant periodic signals generated by a 
fault-tolerant clocking system. Our synchronization algorithm thus suggests an approach to construct 
fault-tolerant clocks. Its practicality depends on whether of hot if is possible to implement the 
desired properties of me synchronizer in a fault-tolerant fashion in the given hardware technology. 

D 

Synchronization Voting 

Under the abstract fault model, a failed control module may send agnate to owy some 
synchronizers in its neighboring synchronization set, but not to others. Par a restricted class of 
hardware failures, however, a failed control module will eitiieraat aend any signah\ or send one to 
each synchronizer within a fixed time interval. For such failures, timing synchronization can be 
achieved using die synchronization voting technique of Da vies and Wakeriy (17). 

In synchronization voting, synchronizers in die same synchronization set are designed assuming 
that input signals in die same batch are in-phase within some 6. No communication is necessary 
among synchronizers. After receiving input signals from f+1 distinct byte slices, a synchronizer 
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waits 8 seconds, and then generates an output signal. 

For the restricted class of hardware failures under consideration, the phase difference S among 
input signals in the same batch received by a synchronizer from fault-free neighboring slices depends 
only on: 

(1 ) variations in propagation delay along paths connecting adjacent synchronization sets (Fig. 2.3), 

(2) variations in propagation and gate delay among synchronizers in the same synchronization set, 

(3) performance incompatibility parameters among p-modutes mfte same synchronization set 

S can thus be computed for each synchronizer after the redundant system is designed and the above 
quantities are known, and used m Ms hardware implementation. 

If a failed module can send signals to some synchronizers in its neighboring synchronization 
set, but not to others, then synchronization voting is not sufficient for timing synchronization. 
Suppose a set of synchronizers designed under synchronization voting receives a batch of signals 
from fault-free slices that are in-phase within 6\ Under the more general fault assumption, one 
synchronizer may receive f + 1 signals 8 seconds before another. Output signals generated by these 
synchronizers can then be out of phase by S. If the p-moduks are performance compatible within t, 
the next batch of outputs generated by fault-free byte slk^ contammg tnesesyiMJirosizers can be out 
of phase by £ + r. Under these conditions* phase difference among signal in the same batch can 
grow without bound along feedback paths. 

The interested reader is referred to [17] for a more detailed discussion on synchronization 
voting. 

D 
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Suppose an encoding scheme is picked for a redundant packet communication system under 
which 2f + 1 byte slices is sufficient For detecting and/or masking failures in up to f slices among 
mem. Timing synchronization can be maintained in such a system by using 2f + 1 byte slices in each 
redundant module, and connecting the control modules in these 2f+l byte slices as shown in Fig. 
2.3. Each synchronization set in this configuration has 2f+l control modules. We have already 
noted that the synchronizer algorithm can be designed to maintain timing synchronization using a 
total of 2f+l input signals. To satisfy the requirement of having at least 3f+l synchronizers 
participating in each round of synchronization activities, we can add f synchronizers to each 
synchronization set, connecting them to every control module in mat synchronization set and to each 
other. These extra synchronizers do not receive input signals nor produce output signals. Their sole 
function is to enhance robustness in the synchronization algorithm, by relaying synchronization 
signals among all synchronizers. The algorithm they execute is: 

Receive signals from f + 1 distinct synchronizers, then send a signal to every synchronizer. 

Finally, note that the steps in the algorithm for reaching agreement among synchronizing 
merge modules does not depend on the contents of the messages exchanged. The algorithm can also 
be used for reaching agreement among a group of modules making multiple-outcome decisions 
independently. In particular the consistency maintenance algorithm can be used to exchange clock 
readings among computer systems reliably, and hence for timing synchronization. This approach is 
clearly more expensive than our timing synchronization algorithm for managing redundancy in 
hardware systems, but it may be appropriate for synchronizing nodes in a computer network. The 
fault tolerance capabilities of mis approach also depends on how hardware failures can be handled in 
an implementation. 
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4. Asynchronous Packet Communication Protocols and Fault Models 

In the last two chapters we have explained our approach to constructing fault-tolerant packet 
communication systems. A redundant system is constructed by interconnecting redundant modules 
via redundant links. Each redundant module has a byte-sliced internal structure. Timing 
synchronization and consistency are always maintained, even in the presence of failures, in a 
redundant system. The output words of a byte-sliced module are encoded for detecting and/or 
masking failures in its byte slices. The emphasis in the last two chapters has been on explaining the 
problems of timing synchronization and consistency maintenance, the approach we have taken to 
sustain them, and the basic algorithms used in this approach. 

In this and the next chapter we study hardware implementation issues. Since die effectiveness 
of our approach to fault tolerance in packet communication systems depends ultimately on our ability 
to implement the redundancy management algorithms given in Chapter 3, we will analyze 
implementation issues in detail. The bask concepts for control module design and analysis are 
presented in this chapter. Hardware implementation of control modules and synchronizing merge 
modules to support our particular approach to constructing fault-tolerant systems are studied m die 
next chapter. 

A control module receives and generates packets and acknowledgments by interpreting and 
generating signal transitions on its input and output fines according to a set of conventions called an 
asynchronous packet communication protocol Without knowledge of the adopted communication 
protocol, there is no basis for discriminating between normal and faulty behavior. We define a class 
of packet communication protocols in Section 4.1 whose use will be assumed in studying hardware 
implementation of control modules and synchronizing merge modules. 
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In Section 4.2 we present fault models for characterizing the behavior of faulty hardware 
modules. We describe the type of output signals mat we assume a faulty module may generate on its 
output lines. The interaction between fault-free and failed modules is characterized formally by 
modeling the interaction between signals that may be generated by failed modules and hardware 
elements used to construct fault handlers. This latter aspect of fault modeling is essential for 
evaluating the effectiveness of fault handler designs under the assumed fault model, but is often left 
out under the stuck-at fault model. 

For studying fault tolerance problems in packet communication systems, we feel that the 
stuck-at fault model is too restrictive, and have chosen to study hardware implementation issues using 
more general fault models which better reflect the sensitivity of self-tuned hardware modules to runt 
pulses and output hazards. Hardware implementation of the control modules and synchronizing 
merge modules will be studied under the stuck-at fault model, the random pulse train fault model and 
the random wave train fault model. These latter two more general fault models are also motivated by 
failure mechanisms in VLSI technologies which are not adequately modeled by the stuck-at fault 
model, Interaction between faulty signals generated under these two fault models and hardware 
elements used to construct control modules and synchronizing merge modules are also specified. 

4.1 Asynchronous Packet Communication Protocols 

We have described packet transmission over a channel in terms of channel state transitions 
(Fig. 2.1). Both data and control information are sent over a channel. A packet communication 
protocol is a convention for interpreting the signal sequences transmitted over the wires of a channel 
to synchronize port activities. A packet communication protocol is asynchronous if the state of a 
channel can always be deduced by examining the signals carried on the wires in the channel, without 
consulting any external timing reference. When an asynchronous packet communication protocol is 
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adopted, a channel can reside in a state for an arbitrarily long period of time. In other words, an 
input port need not process an input packet within a fixed time interval after die packet is available 
and an output port need not generate output packets at a predetermined fixed rate. A packet 
communication module whose port activities are synchronized by asynchronous protocols is then, in 
this sense, a self-timed hardware module. 

An acknowledgment bundle can be implemented using a single, wire. Acknowledgments are 

delivered on this wire as signal transitions, as shown in Fig. 1.1b. For packet transmission on an 

« 

n-wire packet bundle, each packet is represented by an n-bit binary suing. Suppose a packet 
represented by string a is delivered to an input port, acknowledged, and then a packet represented by 
string b is transmitted. The packet bundle connected to the input port may undergo a series of 
intermediate state changes before settling down at b. The set of possible intermediate states that may 
occur when a packet bundle undergoes a state transition from a to b is characterized by the subspace 
covered by a and b, defined as foBows: 

Let a = aj...^ and b = bj...b n be two n-bit binary strings. The se| of n-bit binary strings in 
the subspace covered by a and b, is defined m 

{c = Cj..x b |cj = aj = bj for alii at which 8j = bj} 
The subspace covered by 00100 and 01110 is, for example, 

{0x 1 ly 1 0|x 1 €{0,l},y 1 €{0.1}} 

To facilitate the design of input ports, we use an approach in which a stream of packets is 
transmitted on an n-wire packet bundle by encoding them alternately in two subsets, A and B, of die 
binary n-cube, and choose A and B so that transitions between them is easily recognizable. More 
specifically, A and B should be chosen so that an input port can recognize when a state transition on 
its input bundle, from any element in one set to any element in the other set, is completed. Our 
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approach is to choose A and B so that when a packet bundle undergoes a state transition from any 
element in one set to any element in the other set, no element in either A or B can ever occur as an 
intermediate bundle state. Any two sets of binary strings with this property are said to be separable, 
as defined next 

Consider an n-wire packet bundle. Let A and B be two subsets of the binary n-cube. A and B 
are inseparable if there exist three distinct strings a € A, b € B and c € A U B such that c is in the 
subspace covered by a and b. A and B are separable if they are disjoint and not inseparable. 

In other words. A and B are inseparable if there are elements a in A and b in B such that in a 
state transition on a packet bundle from a to b or from b to a, an element in A or an element in B may 
appear as an intermediate state. The two sets { 010, 111 } and { 100, 001 } are thus separable while 
{ 000, 010, 111 } and { 100, 001 } are not When two separable sets are used for packet transmission 
the receiving input port can detect the completion of the next packet transmission, after a packet 
encoded in either A or B is acknowledged, by looking for a string in the other set on the packet 
bundle. 

For an asynchronous protocol based on two separable sets A and B, a packet bundle state is 
illegal if it is not in the subspace covered by A andB. 1 It s unstable if it is legal but is in neither A 
norB. Elements of A and Bare called stable states. 

The ready/acknowledge handshake protocol is an example of an asynchronous packet 
communication protocol. The** o separable sets are the singletons { } and { 1 }, used on the ready 
wire (Fig. 1.1a) to control packet transmission. Packet contents are delivered on the data wires. 



1. The subspace covered by two sets A and B is: 

U { subspace covered by a andb|a€ A, b € B } 



Another example of an asynchronous packet protocol b die daaHaS protocol. InaduaHraS 
protocol every information tit in a pactet B transmit ted ever a pair of wires. A packet bundle thus 
cons&s of a wire pairs, kfema jaor^statewlteBaBwhTscarfydK togkvahieO, It is hi a dais state 
when exactly one wire at each pair carries the logic value 1. The singleton set, consisting of only the 
spacer, and &e set of a* data states provide two separable sets for asynchronous packet transmission. 
A packet s transmitted over the bundle m a spacer -» data transition. A data -* spacer transition 
resets the bundle state for die next transmission. Under die dual-rail protocol, a packet bundle state 
is illegal if some wire pair carries 1 on both wires. Data states and the spacer state are stable legal 
states. Other legal states, in which some wire pair carries "00", are unstable. Channet state transitions 
for sending Mat packets in dual-rail is iBustrated in Fig. 41. The two separable sets used m a 1-bit 
dual-rail protocol are {«>, and {01. 10}. 
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Other examples of asynchronous packet communication protocols can be found in [4]. 

An input port can detect the arrival of a new packet using a packet detector (Fig. 4.2a). A 
detector is constructed out of acceptors for two separable sets A and B, and a Muller C-elemenL The 
output of the C-element keeps its present value as long as its two input values disagree, and changes 
to the common input value otherwise. Operation of the C-element is described by a transition 
diagram in Fig. 4.2 (b). A total state of the C-element specifics the signal values carried on its input 
and output wires. The unstable states of a C-element are marked by an asterisk in the transition 
diagram. A C-element acts by making transitions from unstable to stable states. An acceptor for a set 
S is a hazard-free combinational circuit whose output is 1 if and only if its input is in S. The output 
of a packet detector undergoes -* 1 and 1 -* transitions as the monitored bundle makes 
transitions between the two separable sets. A packet detector for dual-rail protocols is shown in Fig. 
4.2 (c). Packet detectors in synchronizing decoders are called byte detectors because each packet 
received by these decoders contains exactly one byte of each input word. 

Further restrictions, based on the notion of Hamming distance [46], [65], for example, may be 
imposed on separable sets used in packet communication to support error checking and error 
correction. We next illustrate a technique that can be applied to translate familiar error detecting or 
error-correcting codes [46], [65] into a dual-rail code with the following example. 

An n-bit dual-rail parity code partitions the data states of a packet bundle according to their 
parity. Let the dual-rail encoding of the information bits and 1 be "01" and "10", respectively. A 
data state in a dual-rail protocol has even parity if it contains an even number of "10"'s, and odd 
parity otherwise. An even-parity dual-rail asynchronous packet communication protocol is just a 
dual-rail protocol whose data states all have even parity. In using a dual-rail even parity code, even 
parity packets alternate with spacers on packet bundles. 
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If a dual-rail parity code is used to detect failures in a redundant module, each dual-rail bit 
should be generated by an independent slice to assure that common failure modes, such as single slice 
failures, affect at most one dual-rail bit Encoding and decoding packets for fault tolerance are 
further discussed in Section 5.2. 

4.2 Fault Modeling 

We are interested in designing control modules and synchronizing merge modules mat can 
tolerate hardware failures in their neighbors in maintaining synchronization and consistency among 
byte slices in a redundant module, and detect and mask data errors. To study these designs 
vigorously, wc need to explicitly specify the interaction between signals generated by failed modules 
and hardware elements such as combinational gates, C-elements and latches, used in constructing 
fault handlers. We call such a specification a fauft model. In using a given fault model in fault 
handler design, we arc assuming that under the most common hardware failure modes the underlying 
physical interaction between signals generated bj failed modules and hardware elements in fault 
handlers can indeed be characterized by the adopted fault model 

In this section we introduce the stuck-at fault model, the random pulse train fault model and 
the random wave train fault model. These latter two models are generalizations of the widely used 
stuck-at fault model. Several failure mechanisms m VUS tfMOS and CMOS technologies that 
cannot be modeled by classical stuck-at fault models are reported in [28], [63]. Some of these 
mechanisms are aging processes which modify me electrical characteristics of basic transistor circuits. 
It seems mat as feature sizes, path widths and separation between active circuits reach submicron 
levels, the stuck-at fault model will not be adequate for modeling many on-chip failure mechanisms, 
or transient failures caused by external interference. 
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We note that the stuck -at fauk model, the random pulse tram fault mode! and die random wave 
train fault mode) form a strict hierarchy in terms of generality and modeling power, the random wave 
train model being the most general, the stuck-at model the most restrictive. In the next chapter we 
study the design of control modules and synchronizing merge modules using these fault models. For 
a specific implementation technology, a fault model can be validated cither through experimental 
measurements or physics modeling. Validation of these fault models for available implementation 
technologies is beyond the scope of that thews. 

The Stuck- At Fault M«kl 

Under this fault model a signal generated by a railed module appears to a hardware element in 
a fault-free module as if the signal is stuck at ejther the lqgjc level or the logic level 1. The behavior 
of a hardware element receiving a faulty signal is identical to that of the element under fault-free 
operation, with its corresponding input tied to either or 1. This is die most common fault model 
assumed in studying fault-tolerant digital systems, and is often used to characterize logic gate failures 
caused by output lines shorting to ground or V^. This fault model is used in many production 
environments, semiconductor manufacturers, for example, for test generation. 

A byte slice in a redundant module communicates with its neighbors via an asynchronous 
packet communication protocol. Our stuck-at fault model for byte slices states that a fafled byte-slice 
behaves exactiy like the fault-free slice except that some of its output lines are either ORed with Vqq 
or ANDed with ground (Fig. 4J). Thus while some output lines of a faikd byte slice are stuck, 
signals may still be generated on its other output fines in accordance with the adopted 
communication protocol. Wc will also explicitiy assume that an output line of a failed byte slice can 
become stuck at a logical value only when the output signal on that fine is already residing at that 
value, and thus no spurious signal transition is ever generated. Under die stuck-at fault model in Fig. 
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Fig. 4.3. The stuck-at fault model for byte slice failures. 



4.3, an output byte or acknowledgment generated by the failed slice can be delivered to some, but not 
necessary all, neighboring control modules. An acknowledgment wire stuck at cither or 1 will stop 
returning acknowledgments. Kit if some wires in a packet bundle (Fig. 2.1) are stuck, some packets 
generated by the failed slice may still be delivered on that bundle. These situations will be discussed 
in more detail when we study fault handler designs under the stuck-at fault model in Section 5.1 2. 



The Random Puke Train Fault Model 



Under this feult model signals generated by a failed module may oscillate randomly between 
and 1, but do not reside at intermediate levels for any significant period of time (compared to logic 
gate delays). We model such a faulty signal as a random pulse train (Fig. 4.4a). The interaction 
between a random pulse train and a fault-free hardware element is as follows: 
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(a) A random pulse turn. 
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(b) Interaction between a random pulse train and an OR gate. 



Fig. 4.4. The random pulse train fault model 
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Fig. 4.4. The random pulse train fault model (continued). 
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Interaction between a combinational gate and a random pulse train 

A combinational gate, such as an AND gate or an OR gate, is modeled as a logic gate whose 
output at time t is obtained by applying the corresponding boolean function to its input value at time 
t, followed by a delay clement whose input signal is delayed by some given S. Interaction between a 
combinational gate and a random pulse train input is illustrated using an OR gate in Fig. 4.4b. 

Interaction between a C-clcment and a random puke tram 

The input/output behavior or a Muller C-element is illustrated hi Fig. 4.2. More precisely, we 
assume that the state table in Fig 4.2b specifics the output of the C-element at time t given its input 
and output signal values at time (t-fi) for some fixed 6. Interaction between a C-clcment and a 
random pulse train input is illustrated in Fig. 4.4c. In modeling this interaction we have made the 
assumption that signal transitions in a random pulse train are sufficiently far apart that a C-element, 
when activated, will always settle down in its new state before the next transition occurs, and that a 
random pulse train does not drive a C-elcmeot into a metastable state. Again, we caution the reader 
that the validity of these assumptions should be carefully, checked before adopting this fault model 
for specific hardware implementation. Our key application for Celements in control modules design 
under this fault model is for filtering out random pulses using a fault-free input signal, as illustrated 
in Fig. 4.4c. 

Interaction between a latch and a random nuke tram 

For a latch receiving a random puke train under the control of a fault-free latching pulse, we 
assume that the output of the latch will settle down at either or 1 within a fixed time interval after 
the latching puke. It is well known that if the input to a latch changes its value at about the same 
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time the latching pulse arrives, the latch may be driven into a metastatic state and remain in that state 
for an arbitrarily long time. For many technologies the probability that a latch, after entering its 
mctastable state at timet remains in that state at t+5 decreases rapidly with S. It is thus possible to 
improve the accuracy of the random pulse train fault model by assuming a longer settling time 
between delivering a latching pulse and reading the output of the latch. The probability that the 
latch will remain in its mctastable state after the assumed settling time is, however, nonzero and must 
be taken into account in calculating reliability measures. 

The Random Wave Train Fault Model 

Under this fault model signals generated by a failed module can wander arbitrarily in die 
region bounded by the signal values and 1. We model such a faulty signal asa random wave train 
(Fig. 4.5a). The interaction between a random wave train and a fault-free hardware element is as 
follows: 

Interaction between a combinational gate and a random wavt train 

Same as that between a combinational gate and a random patse train. Interaction between a 
combinational gate and a random wave train input is illustrated using an OR gate in Fig. 4.5b. 

Interaction be twwn a Ottancrtt and a random wave train 

We assume that a random wave train input can be propagated to the output of the C-element, 
as illustrated in Fig. 4.5c. Thus under the random wave train fault model, C-elcments are no longer 
useful for filtering faulty signals. 
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(a) A random wave train. 



E> 



Out 



1 ~ 




Out 




V^ 



(b) Interaction between a random wave train and an OR gate. 



Fig. 4.S. The random wave train fault model 



87 




Out 




Out 




(c) Interaction between a random wave train and a C-elemenL 



Fig. 4.S. The random wave train fault model (continued). 



Interaction between a latch and a random wave train 



Same as that between a latch and a random pulse train. 



43 Discussion 

A packet communication module interacts with its neighbors at several different levels. At the 
wire level it delivers signal transitions on wires to its neighbors. At the packet communication level it 
delivers packets to its neighbors over a bundle of wires and delivers acknowledgments over 
acknowledgment wires, in accordance with the adopted packet communication protocol. Higher 
levels of interaction, interpreting a packet as a request for service, for example, need not concern us 
here. The three fault models we have given specify die behavior of a failed module at the wire level. 
In the next chapter we will relate these faults on wire bundles to packet communication errors and 
discuss how these faults relate to the abstract fault models assumed in the redundancy management 
algorithms presented in Chapter 3. 

The generation and propagation of runt pulses under the random pulse train and random wave 
train fault models can be illustrated by a simple example. Consider die synchronous digital system 
shown in Fig 4.6. The outputs of the processing module are protected by an error-detecting code, 
and checked by an error detector. These modules are synchronized usiog a two-phase clock. At ?} a 
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Fig 4.6. An error detection scheme for synchronous systems. 
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ncw input for the processing module is stored in its input latch. Hie corresponding output is stored 
in its output latch at q^- This output is checked for errors and the error indicator flag is set at the 
immediately following qpi- Suppose the processing module has failed and generates random wave 
trains or random pulse trains at its output port, violating the setup and holding time requirements of 
its output latch for synchronization with qp^ Under these conditions the output latch of the 
processing module may enter a metastatic state, as explained in Section 1.2.3, remain in this state for 
an arbitrarily long time, and then exit into a stable state that violates the error-detecting code and 
hence is erroneous. If this transition occurs late in the clock cycle following qp^ mc output of the 
error detector may not indicate an error at the next f» j. In particular, the output signal of the error 
detector may still be residing at an immediate value between and 1 when the next qpj occurs, and 
can be interpreted as indicating an absence of errors when latched into the indicator flag. In this way 
an erroneous output may be propagated from the faulty processing module to other hardware 
modules. This sort of pathological behavior cannot occur under the stuck-at fault model. 
Redundancy techniques for eliminating such pathological behavior in redundant packet 
communication systems are discussed in the next chapter. 

Regarding packet communication protocols, we have presented a class of asynchronous 
protocols in Section 4.1 that is sufficiently general for implementing packet communication systems, 
and is yet still quite easy to characterize. It is possible to construct more general classes of protocols 
but their implementation may require more powerful and more complex encoding and decoding 
equipment Separable sets can also be used to generalize the notion of an acknowledgment, to 
construct systems in which packets are transmitted in both directions over a single channel. The 
applicability of these generalizations remains to be investigated. It would also be of interest to 
compute the maximum number of distinct packets mat can be represented using separable sets of 
binary strings of a given length. This problem has been studied in [5J. 
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5. Control Module and SynchfBMzkg Merge Modtile Design 

In this chapter we study hardware implementation of control modules and synchronizing merge 
modules. While p-moduks other man synchronizing merge modules can be implemented under 
unbounded gate delay assumptions [4], [31J, [33], [39], we haw found it convenient to design 
synchronizers, decoders and synchronizing merge modules using more conventional asynchronous 
state machine design techniques. Delay elements are inserted on selected paths in each control 
module to ensure that inputs to these asynchronous state machines indeed conform to assumptions 
made on their rate of change. Upper bounds on signal propagation delays among synchronizing 
merge modules in a redundant module are also used to implement tone-out mechanisms in the 
consistency maintenance algorithm. Note that all such timing considerations are confined to within 
the same redundant module, and hence each redundant module can still be constructed without 
detailed knowledge of the internal operating speed of its neighbors, other than upper bounds on 
phase differences among packets and acknowledgments in the same batch received by die redundant 
module from these neighbors. 

Hardware implementation of control modules and synchronizing merge modules is studied in 
Sections 5.1, 5.2 and S.3, using the fault models given in Section 4X In Section S.1 we study 
implementation of die timing synchronization algorithm in die synchronizer section of control 
modules. In Section 52 we examine techniques for decodmg packet contents m ok decoder section 
of synchronizing decoders. General implementation techniques for these decoders are discussed, 
without going into detailed case studies based on specific encoding schemes or decoder designs. In 
Section S3 we study implementation of the consistency maintenance algorithm in synchronizing 
merge modules. Our approach to constructing redundant packet communication modules and die 
implementation techniques explained in this chapter are illustrated by two design examples in 
Section 5.4. 
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Throughout Section S.l, 5.2 and 5.3, we assume that tiiere are 3f+l byte slices in each 
redundant module, up to f of which may fail. There wffl «us be enough failure-independent byte 
slices to maintain synchronization and consistency using the algorithms presented in Chapter 3 under 
the abstract fault model presented therein. Hardware implementation of these algorithms, and their 
use in maintaining timing synchronization and consistency in packet communication systems 
subjected to failures modeled as in Section 4.2, are presented under this assumption. Application of 
our redundancy techniques to hardware modules which are not byte-sliced internally is illustrated in 
the design examples given in Section 5.4. 

5.1 Synchronizer Implementation 

Let us briefly review the operational environment of a synchronizer. Referring to Fig. 23, a 
synchronizing decoder receives input bytes from a synchronization set of 3f+l fanout modules. 
Each byte is delivered by a fault-free fanout module as a packet to me synchronizing decoder using 
the adopted packet communication protocol. The arrival of a sew byte at the synchronizing decoder 
is detected by a byte detector (Fig. 3.1a) and signaled to its synchronizer. In a fanout module, a 
synchronizer receives acknowledgment signals from its ne^hboring synchronizing decoders directly. 
The algorithm to be implemented in tile synchronizer b described in Section 3.1. For every batch of 
input signals received, a synchronizer exchanges synchronization signals with other synchronizers, 
and then generates an output signal. 

In the control modules environment, due to die use of handshake protocols, there is a feedback 
path from the output of the synchronizer to its signal sources, such that a fault-free input signal 
source S of tiie synchroniser will not send another signal to the synchronizer until the synchronizer 
has generated an output signal in response to the previous signal generated by S. A delay element 
can be inserted in this feedback path, in die control module containing the synchronizer, to regulate 
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the rate at which these input signal sources generate signals to die synchronizer. As we shall see in 
Sectkn 5.1.1, this provides a convenient mechanism for ensuring the proper operation of 
asynchronous state machines used in die synchronizer. 

For the synchronizer implementations studied in this section, a signal is implemented as a 
signal transition, either -* 1 or 1 -* 0, onlhe corresponding wire. We furthermore adopt the 
convention that input signals in the same batch, and die output signal generated by the synchronizer 
in response to that batch, arc all represented by signal transitions in die same direction. In the 
following discussion wc need to distinguish between die logical signals received and sent by a module 
and the signal waveforms carried on its input and output wires* We call the former hgkelsiptab and 
use die term signal to denote the latter. 

For each batch of logical input signals received by a syndmmizer, at least 2f+l of them will be 
signal transitions generated by fault-free byte slices, and wffl be in-phase within some known S. 
Failed shoes can exhibit pathological behavior, and we wffl onfy consider die sort of pathological 
behavior allowed in our fault models. Under these input conditions, a har dw ar e implementation of a 
synchronizer must support (SI), (S2) given in Section 3.1, and (S3): 

(51) For each new batch of input signals received, aD fault-free synchronizers in the same 
synchronization set will generate output signals within a faed tune mterval of each other. The 
duration of this interval is bounded by a constant which can be calculated from the time it takes 
a synchronizer to perform basic operations, and each bask operation can be p erfor m ed in a 
fixed time. 

(52) If input signals in die same batch are generated by fault-free neighboring slices within a fixed 
known time mterval of each other, then for each new input batch, a synchronizer wffl generate 
its output signal only after tt has received at signals in that batch geaented by fauk-free shoes. 
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(S3) The output signal generated by a fault-free synchronizer must not contain hazards or runt 
pulses. 

The last property (S3) is especially important in self-timed hardware systems whose modules react to 
signal transitions on their input wires. AH these three properties must be satisfied together in a 
synchronizer implementation to support our overall approach to timing synchronization. We first 
present a synchronizer design which implements the synchronization algorithm under fault-free 
operation, and men examine how this design can be enhanced to deal with hardware failures 
modeled by the three fault models explained hi Section4.2. 

5.1.1 Synchronizer Implementation under Fault-Free Conditions 

The basic operations in a synchronizer are to generate logical synchronization signals and 
logical output signals in response to die receipt of logical input signals or logical synchronization 
signals. These bask operations have die form: 

"Receive logical signals on n of the 3f + 1 input fines, then generate a logical signal." 
In an implementation, the logical signals received and generated by each such operation are all 
represented by signal transitions in the same direction. We first design * generator ckaiil (Fig. 5.1a), 
denoted by gen{n, if+1% for performing this operation. During fault-free operation, the input and 
output signals of die generator circuit undergo the following cycle: 

(gl) All fault-free inputs are at 0. 

(g2) n or more of the inputs become 1, output is then set to 1. This corresponds to receiving a 

batch of logical input signals and generating a logical output signal in the synchronization 

algorithm. 
(g3) All fault-free inputs are at 1. 
(g4) n or more of die inputs become 0, the output is then set to 0. This corresponds to 
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receiving a second batch of logical input signals and generating a second logical output 
signal in the synchronization algorithm. 

In the context of control modules, due to the use of handshake protocols for packet 
communication, we can also control die duration at which the input signals to the synchronizer 
remain at input state gl or g3. This can be achieved by delaying the acknowledgment returned by the 
control module containing the synchronizer to its predecessor. The kind of control is important 
because several asynchronous state machines are used in die synchronizer implementation, and their 
outputs would be hazard-free only if their internal state variables are given sufficient time to stabilize 
in between input changes. The rate of input change is controlled by holding the input states of a 
generator ingl and g3 for as long as is necessary. 

A generator circuit genin, 3f + 1) (Fig. 5.1a) is constructed using two threshold circuits and an 
asynchronous state machine which we will call a sequencer module. 

A threshold circuit (Fig. 5.1b), denoted by f/uA (m, 3f+U is used to delect the arrival of m 
logical signals, b is constructed ftom Q$&, &*i)\ trees of AND gates and a tree of OR gates. Its 
output is 1 if and only if m or more of to 3f+ 1 input tignak treat 1, and it free of hazards under die 
following operating < 



As the input state to a threshold circuit cycles from gl through g4, a fault-free input 
signal will not make another transition until aftg the output signal transition generated m 
response to the previous transition on that input signal hasbeen observed on die output 
line of the threshold circuit 



1. C(m, 3f + 1) is the number of combinations of choosing m objects out of 3f+ 1 distinct ones. 
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Since these threshold circuits are used aside control modules, this con<fitk» is automatically 
satisfied by using asynchronous handshake protocols for packet commtnutation in die har d ware 



h die geoerab)r0^5JaXtheilts^n,3r+l)chcuit detects die inrxit state transtkms from gl 
to g2, the rtrt(3f-n+2, 3f+l) circuit detects the input state transitions from g3 to g4. The output 
signals of these twotftreshold efcments cycle through die fbOowingi 



(tl> ifa*(n,3f+l) = 0, /Aa*(3f-n+2.3f+l) = 0. (generator inputs at gl) 
(t2) ifaA(n, 3f + 1) = 1, ifcrt(3f- n+2, 3f + 1) = - , (generator inputs changes from gl tog2) 
(t3) tlattn, 3f+l) = 1. tfsa(3f-n+2, 3f+l) = 1. (generator inputs atg3) 
(t4) thsttn, 3f+I) = -, tksH3f-n+2, 3f+l) = 0. (generator inputs changes from g3 to g4) 
and dwn hack to tl when die raput unsafe to die generator change tram gf back togl. 



Under the signal representation conventions, Ae output signal of die generator should go from 
to 1 when die input signals of die generator go from gl to g2 or, cqumfendy, when die output 
sigtialaftherJtresheitfeiente^ Sndhs%dto output signal of the 

generator niidcrnoeialtoO transition np^ 

or, cquivakntly, a 1 to transition on die output signal of the threshold dement t&sA(3f-tt+2, 
3t+l\ 

The output of the generator circuct is derived from die outputs of the two threshold circufts 
using die asynchronous state machine, which we wslcatt a sequencer module, shown hi Rg. 5Jc The 
sequencer module has a single binary state variable, knpkmented widt a C-efcmenL When both 
inputs II and 12 are 0, the output of die C-eJanent is and die next -+ 1 transition on II is 
transmitted to the output Bne Out. When both inputs are l,fte output of die C-etemeni island the 
next 1 -* transition on 12 is transmitted to Out. The output signal defivcred at Out by a sequencer 
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module is hazard-free under a delay assumption and an input assumption: 



(1) If both input signals to the demultiplexer (Fig. 5.1c) have the same value as its output signal, then 
selecting one input signal instead of the other by changing the output of the C-element will not 
cause a spurious pulse at Out. This can be assured if in the demultiplexer (Fig. 5.1c), 

delay (Not) + delay (Andl) > delay (And2) + delay (Or) 

(2) The input states remain at 00 and 11 long enough to allow the internal state of the sequencer 
module to settle down before the next input change occurs. The previously discussed 
mechanism of delaying the return of acknowledgments from a control module to its predecessor 
is useful for ensuring that this condition is satisfied. By holding the input signals of the 
generator circuit at gl and g3, the output signals of the two threshold elements, and hence the 
input signals to the sequencer module, can be held at tl (00) and t3 (11), respectively. 

□ 

Three generator circuits are used in a synchronizer (Fig. 5.2a): 
Gl: gen (f+ 1, 3f + 1), generates a logical signal after receiving f+ 1 logical input signals, 
G2: gen (f+ 1, 3f + 1), generates a logical signal after receiving f + 1 logical synchronization signals, 
G3: gen (2f+ 1, 3f + 1), generates the logical synchronizer output signal after receiving 2f + 1 logical 
synchronization signals. 

Logically the synchronizer generates a synchronization signal as soon as either Gl or G2 
generates a logical signal. Under our signal representation, this is implemented by combining the 
output signals generated by Gl and G2 using another asynchronous machine called the detector 



1. For delay considerations, a logic gate or a C-element is modeled as a hardware element, which 
reacts instantaneously to input changes according to the corresponding boolean function or transition 
diagram (Fig. 4.2b), followed by a delay element 
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Fig. 5.2. Hardware implementation of a synchronizer. 
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module (Fig. 52b). A detector module differs from a sequencer module in that it does not pass a 
-» 1 transition on one input signal and men a 1 -* transition on »e other to Hs output line. The 
detector module also has a single binary state variable, anptemented with a C-element When both 
inputs II and 12 are 0, the output of the C-etement is 0, and the next -* 1 transition on either II or 
12 is transmitted to me output Bne (hit. When both inputs are 1, the output of the C-element is 1, 
and the next 1 -♦ transition on either II or 12 is transmitted to (hit. The output signal delivered at 
Out by a detector module is hazard-free under a delay assumptions and an input assumption: 

(1) If both input signals to the demultiplexer have the same value as its output signal, then selecting 
one input signal instead of the other by changing the value of "select" from to 1 or from 1 1 
will not cause a spurious puke at Out. This can be assured if in the deraultiplexor (Fig. 52b): 

delay(Not) + deJay<And2) > delay(And3) + delay<Or2) 

(2) The input states remain at 00 and 11 long enough to allow the internal state of the detector 
module to settle down before the next input change occurs. This condition can be satisfied by 
preventing input signals and synchronization signals from fault-free modules from changing too 
rapidly, which can again be achieved by delaying toe logical acknowledgment signals returned 
by the control module containing toe synchronizer to its predecessor by an appropriate amount 
oftime. 

□ 

Finally, to satisfy (S2). the output of G3 is delayed 6 seconds (Fig. 52a), where 5 is me phase 
difference among logical input signals in the same batch delivered to the synchronizer. 

This completes the description of our synchronizer design. Let us examine the operational 
characteristics of this design during fault-free operation and the support it provides for our overall 
approach to constructing fault-tolerant packet systems in more detail. We have noted that a 
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synchronizer implementation must satisfy properties (SI). (S2) and (S3) together. (SI) is satisfied 
since the basic operations in the synchronizer are implemented using hardware elements such as 
C-elemcnts and logic gates, each of which has a fixed, bounded, reaction time under ail possible input 
conditions presented to them in the synchronizer during fault-free operation. (S2) is satisfied by 
delaying die output of G3 by 6. (S3) is achieved by designing die threshold circuits, die sequencer 
modules and the detector module to be hazard-free under these input conditions. For system 
integration, the phase difference among packets and acknowledgments « ihe same batch delivered to 
a control module is used to determine the proper delay element for delaying the logical output signals 
generated by G3. 

To analyze die fault tolerance capability of die synchronizer implementation, let us define an 
operation cycle of a synchronizer as the period between die time when aft fault-free input signals to 
the synchronizer are at die same logic lever to the next time when afl these fault-free input signals 
have switched to the complementary logic level. Due to die use of asynchronous packet 
communication protocols, operation cycles can vary arbitrarily in length. Proper synchronization 
assures only that all fault-free input signals wffl change state together within die test « time units of 
each operation cycle, for some fixed A. WecanchaneteriwdwrestrirtkmOTmput 
by failed modules imposed by die abstract fault model presented in Section 3.1 as: 

Suppose in an operation cycle every fault-free input signal from a group of 
neighboring control modules, or from a group of synchronizers m die same 
synchronization set, makes a transition from one logic value a to the other logic value ~a 
Then an input signal generated by a failed module in that group, either stays at a for die 
entire operation cycle, or makes die same transition exactly once and then stays at -a in 
that cycle. 
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Under this restriction on input signals delivered by failed modules, (SI), (S2) and (S3) can be 
maintained using the synchronizer implementation presented above. We will show that under the 
stuck-at fault model, the output signals generated by failed modules indeed obey this restriction, and 
hence the above synchronizer can be used directly to maintain timing synchronization under this 
fault model. Under the random pulse train fault model, we show that we can filter the output signals 
generated by failed modules to derive signals which obey this restriction, and then feed these signals 
to the synchronizer implementation presented above to maintain timing synchronization. An 
approach to tolerating failures under the random wave train fault model is discussed in Section 5.1.4. 

5.1.2 Synchronizer Implementation under the Stuck-At Fault Model 

A synchronizer in a fanout module receives acknowledgments directly from its neighboring 
slices and will stop receiving acknowledgments on any input line which is stuck at either or 1. Thus 
a neighboring slice which has failed will appear to a synchronizer in a fanout module as if it has 
stopped sending acknowledgments. 

A synchronizer in a synchronizing decoder receives signal transitions from its neighboring slices 
indirectly through byte detectors monitoring the output packet bundles of these slices. Suppose a 
failed byte slice attempts to set its output packet bundle to a certain state to transmit a new byte. If 
some wires in its output packet bundle are stuck at either or 1, then states which may appear on the 
bundle are restricted to those with the corresponding bits fixed at the corresponding values. If the 
failed byte slice attempts to set its output packet bundle to a state which is not in this restricted 
subset, the desired transition will not be observed by the byte detector monitoring the packet bundle 
and will not cause a signal transition on the output of the byte detector. Later on, the failed byte slice 
may be successful in delivering another packet- encoded by a packet bundle state in the restricted 
subset The arrival of this new packet is then detected by the byte detector and signaled to the 
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synchronizer. The stuck-at fault mode! for packet conuuunkation modules (Fig, 43) also i 

that a failed slice behaves just Bkc *e faufc-free slice, escept tkot souk ofiB output lines are stuck at 

either or 1. Thus a fatted sfiee wfli attempt to delves a new byte, using the adopted protocol, only 

when an fauk-free slices in flic same i rdun d ant mo^ 

8. 

These fault phenomena can be illustrated using die byte detector designed to receive packets in 
dual-rail (Rg. 4.2). If both fines in an input wire pair to die byte detector are stuck, at either or 1, or 
oneoftrte wires in a wire pair is stuck at 1, the output of die byte detector wul also become stock at 
some logical value, either or L if one of Me wires beconKS stack at flt padtet coMwiunication can 
continue until a dual-rail bit encoded by a 1 on the wire suffering die stuck-at fault and a on the 
other wire in the wire pair is transmitted. The wire pair wifi then stay m the spacer state, and the byte 
detector will not signal die arrival of die new byte. The corresponding ok of a subsequent packet 
may, however, be encoded by a on the wire stuck at 0, and a 1 on the odier wire ia die ware pair. 
The arrival of this new packet is then be detected by die byte detector and signaled to die 
synchronizer. 

Thus under the stuck-at fault model, Me input signals dehvered to a synchronizer satisfy the 
restriction explained at the end of Section 5.1.1. The synchronizer design presented in Section 5X1 
can thus be used to construct control modules which can maatam properties (SIX (S2) and (S3) in a 
synchronization set. 
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5.13 Synchronizer Implementation under the Random Pulse Train Fault Model 

Under the random puke train fault model (Fig. 4.5), a random pulse train signal generated by a 
failed byte slice neighbor can propagate through a byte detector and appear on the corresponding 
input line of the synchronizer. If mis signal is processed by the synchronizer presented in Section 
5.1.1 directly, property S3 can be violated and the synchronizer can deliver random pulse train 
outputs, according to the assumed fault model for interaction between random pulse trains and 
hardware elements in the synchronizer. Under mis fault model, however, a random pulse train can 
be "filtered" by passing it through a C-element whose output is regulated by a fault-free reference 
signal (Fig. 4 Sc). 

In a control module, we can derive a reference signal for this purpose from either an 
acknowledgment bundle (in synchronizing decoders) or a packet bundle (in fanout modules) in the 
control module containing the synchronizer. For synchronizer design we can assume that these 
signals are free of errors and obey the adopted asynchronous handshake protocol. This is because a 
byte slice is taken as a unk for fault tolerance considerations and in a fault-free slice synchronizing 
decoders and fanout modules receive acknowledgments and packets, respectively, from the p-module 
in the same slice. These signals can be used a* reference signals because lor the class of packet 
protocols presented in Section 4.1, acknowledgments and packets always alternate in a channel 
operated under fault-free conditions. We next illustrate the application of these ideas in a 
synchronizing decoder. 

For a synchronizing decoder, our strategy is to filter die signals generated by the byte detectors 
and by other synchronizers in die same synchronization set with a column of C-elements (Fig. 5.3a) 
regulated by the acknowledgment signal received from the p~module inutile same shce, before feeding 
these signals to the synchronizer. Conceptually we are reinforcing die packet protocol at the 



-104 



acknowledgment 



packet bundle 



packet bundle 




from other synchronizing decoders 



(a) C-etement filters in synchronizing decoders. 



packet bundle 




acknowledgment 



acknowledgment 



0^©- 



1 • • • 
from other fanout modules 



(b) C-element filters in fanout modules. 



Fu> S3. Filtering random puke train inputs in control modules. 



105 



C-element filters such that the arrival of new bytes and logical synchronization signals at die 
synchronizer alternate with the receipt of logical acknowledgment signals from the p-module. In 
terms of processing binary signals, we are imposing the following restriction on the input signals 
received by a synchronizer: 

Suppose in an operation cycle every fault-free input signal in a group of 
neighboring control modules, or in a group of synchronizers in the same synchronization 
set, makes a transition from one logic value a to the other logic value <-a. Then an input 
signal belonging to that group, generated by a faHed module, either stays at a for the 
entire operation cycle, or makes the same transition ewctly once and then stays at ~a in 
that cyele. All such transitions can occur only after die logical acknowledgment signal for 
die previous batch of transitions has been received by die synchronizing decoder 
containing the synchronizer. 

Similarly, a reference signal can be derived from the packet bundle in a fanout module 
(Fig. 5.3b) and used to filter random pulse trains on acknowledgment anes before feeding them to 
the synchronizer in the fanout module. 

Any input signal obeying mis restriction also obeys the restriction given at the end of Section 
5.1.1. A synchronization set of control modules can be consmicted usiiig synchronizers erihanced by 
these C-element filters, and properties (SIX (S2>and (S3) can be mauttaiaed in this synchronization 
set under the random pulse train fault model 
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5.1.4 Synchronizer I»pte»eirtatioii under the Randon Wave Train FauttMo<k^ 

As illustrated in Fig. 4.6, random wave trains can propagate through hardware elements in the 
synchronizer and cause runt pulses on the output signals of threshold circuits, sequencers and 
detectors, as well as on the output signal of the synchronizer itself. Random wave trains are basically 
analog signals and can he filtered with analog filters. Since sequencers and detectors are 
asynchronous state machines, it is attractive to filter the output signals of threshold circuits before 
feeding mem to Aese two types of modules. We next oudme an approach to enhance the 
synchronizer design presented in Section 5.1.1 to deal with random wave trains. Its detailed 
verifkatron iequm» aiialogeircu^ 

When all fault-free input signals to a threshold circuit are at the same signal level « the output 
of the threshold circuit will also be set at a As soon as one fault-free input signal changes its state to 
~a, it becomes possible for the faulty signals to conspire together and cause die output of the 
threshold circuit to oscillate between and 1 amf generate runt pulses. If input signal transitions 
from fink-free sources are in-phase within «, the time period during which oscillation may occur and 
the width of runt pulses will both be bounded by a. When a is known, we can buM a towpass finer 
to filter out these oscillations and runt pukes. 

Let us first consider adding towpass filters to each of 1he generator cinnrits m the syndmmizer. 
For the threshold elements m die generator cncuh Gl (F% 5^) their logical input signals are 
in-phase within ft, where ft is known when specifications b^ne^ib o ri ng redundant modules are 
given. The towpass filters on their output signals should thus he designed in filter nut all myfflatinnfi 
with period shorter than ft. For generator circuit G2, a little thought reveals that all fault-free 
synchronizers are guaranteed to generate logical synchronizer signals in the same batch within ft, 
since they will all receive at least f+1 logical input signals from fault-free input sources within ft. 
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The output signals of the threshold circuits in G2 should thus also he filtered to eliminate oscillations 
with period less than 8 using lowpass filters. For generator circuit G3, we showed in Lemma 3.1 that 
fault-free synchronizers will generate logical synchronizer signals in the same batch within 2p, where 
p is conceptually an upper bound on the time it takes for a synchronizer to generate a logical 
synchronizer signal after receiving at least f+1 logical synchronizer signals, plus the time it takes for 
this newly generated signal to reach the other synchronizers. In a synchronizer implementation, p is 
an upper bound on the delay through the generator circuit G2 and the detector module (Fig. 5.2), 
plus an upper bound on the propagation delay between synchronizers. 

To support our timing synchronization methodology, a synchronizer implementation must 
satisfy properties (SI), (S2) and (S3) given at the beginning of this section. (S2) does not concern the 
design of lowpass filters. To satisfy (SI), a signal transition, from either 0-»l or from 1 -» 0, must 
be propagated through the lowpass filter in a fixed time interval. Consider when me output signal of 
a threshold circuit used in the above contexts is at 0. Immediately before changing to 1, this signal 
may oscillate between (land 1, for up to a seconds, for some fixed a. We require mat as soon as tiiis 
signal "stabilizes" at 1, a -♦ 1 transition will be observed on the output signal of the lowpass filter 
within fi seconds, where fi h another fixed constant independent of a. This requirement also holds 
for 1 -» transitions. To satisfy (S3), the output signal of the lowpass filter should not contain runt 
pulses. 

Implementation of lowpass filters m available hardware technology, deviations exhibited by 
these implementations from die desired characteristics, and consequences of such deviations in 
maintaining tuning synchronization must be investigated before ttie effectiveness of this approach 
can be evaluated. 
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52 Decoding SecUoa Lay i cwe at atioM 

In our approach to constructing redundant packet communication systems, hardware functions 
are implemented with byte-sliced modules whose output words and acknowledgments are encoded. 
A word is encoded in mat it consists of a batch of packets, each generated by a separate byte slice and 
containing one byte of the word, and die separable sets used in the packet protocols to deliver each 
batch jointly support some error-detecting or error-correcting capability. The example of even-parity 
dual-rail protocol has been given at die end of Section 4.1 to illustrate such protocols. An 
acknowledgment generated by a byte-sliced module is encoded in that it is represented by a batch of 
acknowledgment signals, one generated by each byte slice. Hardware failures, limited to some 
maximum number of byte slices prescribed by the error-detecting or error-correcting capabilities of 
the protocol, are detected and/or masked by decoding ouqtut words and acknowledgments generated 
by a byte-sliced module. The goal of maintaining timing synchronization and consistency in a 
redundant system is to provide an accommodating environment for applying encoding techniques. 
As we have explained in previous discussions, the fault tolerance capabilities of aa encoding scheme 
can be compromised if either timing synchronization or consistency is not maintained. Only those 
failures whose occurrences do not lead to loss of synchronization or consistency can be delected 
and/or masked reliably in our approach. 

In die last section we have discussed the hardware implementation of fault-tolerant 
synchronizers which can be used to maintain timing synchronization under die stuck-at fault model, 
the random pulse tram fault model and the random wave train fault model Organization of 
decoding sections are studied in mis section. We note that the problem of "decoding" an 
acknowledgment in a fanout module (Fig. 3.1b) is taken care of using fault-tolerant synchronizers. 
We thus only need to deal with the problem of decoding input words in synchronizing decoders. We 
will explain several general techniques to deal with random pulse tram and random wave train input 
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signals in decoding sections, but will not present decoder designs for specific error-detecting or 
error-correcting protocols. 

A synchronizing decoder is organized internally into byte detectors, a synchronizer and a 
decoding section (Fig. 3.1a). Input words are delivered to a decoding section on several subbundles, 
some of which may be connected to foiled byte slices. A word is encoded to support either error 
detection or error correction. For error detection, input signals to die decoding section are monitored 
to determine whether die input word received contains any errors; To limit error propagation, 
erroneous input words are "held up" in die decoding section instead of being forwarded to 
p-modules. For error correction, input signals to a decoding section are used to regenerate die 
error-free word, which is subsequcnUy forwarded to die p-module n tiiat slice. 

In a decoding section input signals are decoded by signal decoders. We restrict our attention to 
signal decoder designs which can generate error signals or regenerate the error-free input word within 
some fixed time interval after all of Us fault- free input signals have stabilized Using signal decoders 
with this property, our implementation strategy for a decoding section is to receive a timing signal 
from the synchronizer, which indicates that all £ndt-rree input signals to the decoding section have 
stabilized, wait a fixed time interval for the output signals of die signal decoder to stabilize, and then 
use these output signals in em* detection and/or correction. 

Let us examine the requirements imposed on signal decoder designs more carefully before 
studying implementation issues. When all fault-free input signals to a signal decoder have stabilized, 
input signals from failed neighboring slices can exhibit pathological behavior according to the 
stuck-at fault model, the random pulse train fault model or die random wave train fault model. We 
require that within a fixed time interval after all fault-free input signal values are available, the output 
signals of the signal decoder must have stabilized at their appropriate values. Oscillation and runt 
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pulses are permitted, however, in the interim. Under these requirements, signal decoders must be 
designed so that runt pulses and oscillations cannot propagate through them if fault-free input signals 
have already stabilized for some thne. 

An encoding scheme for which we can design such a signal decoder is die familiar byte 
replication scheme. In this scheme each word consists of several identical bytes and die error-free 
word is derived using majority voting. A decoding section for a byte replication scheme (fig. 5.4) 
designed according to our strategy consists of a signal decoder which contains a group of majority 
voters, a delay clement $ to time the propagation defey miough die signal decoder, and a column of 
output latches to store die output values of the majority voters after ftenr output signals have 
stabilized. A majority voter is constructed out of AND and OR gates (fig. 5.4b). Under our fault 
models, a majority voter has die characteristic that if two of its input signals are at die same value, its 
output signal will stabilize at that value after a finite time period, independent of whether die third 
input signal is stuck at some logical value, delivering a random pulse train, or delivering a random 
wave train. 

Another approach to implement signal decoders with the desired property is to store input 
agnail in a cofumn oflatehe* (Fig. 5.5) and decode the outputs of titese fetches alter they have 
stabilized. In die arrangement the latches are activated by the output ^gnai of die s y n ch r o n iz er. 
Non-faulty input signals and those stuck-at eitiier or 1 win always have stabilized when the latches 
are activated by the synchronizer output signal A random pulse train or random wave train may 
cause its receiving latch to enter a mctastable state, according to the fault models in Section 42. A 
latch is allowed seconds (Fig. 5.5) to settle down, where fi is a circuit technology parameter. The 
outputs of the input latches are then decoded and stored in a column of output latches, as in a 
replication scheme. In fig. 5.5, o is an upper bound on die propagation delay through die signal 
decoder. 
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This strategy of reducing the problem of receiving random pulse trains and random wave trains 
to that of waiting for latches to exit from their metastable states can also be used to design decoding 
sections for fault detection, as illustrated in Fig. 5.6. An input word to the decoding section is first 
latched into input latches. After /? seconds, the outputs of the input latches are strobed into a column 
of internal latches. The contents of these internal latches are forwarded to the output latches of the 
decoding section only if they contain no errors. Suppose the adopted protocol is based on separable 
sets A and B. The key input property used for fault detection in the configuration shown in Fig. 5.6 is 
that if the next input word delivered to the detector should be encoded in one of the separable sets, 
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say by aj in A, then no hardware failures to be detected can cause a packet encoded in some other a2 
in A to be delivered to the detector. Otherwise a 2 will be accepted instead of aj and forwarded to the 
p-modules, defeating. the fault detection scheme. As long as mis property holds, errors can be 
confined and detected by monitoring the contents of the internal latches with a packet detector (Fig. 
4.2a) designed to recognize me separable sets A and B. After the input word is stored in the internal 
latches, the packet detector monitoring the outputs of these internal latches will generate a signal 
transition on its output Hue only if the input word stored in die internal latches is encoded in the 
proper separable set, and hence fiee of errors. If me input word stored in Die internal latches is 
erroneous, Le., not encoded m the proper separable set, no transition wift be delivered by the packet 
detector, and the erroneous word win not be stored into the output latches. 

An alarm pulse is generated from the output of the synchronizer and the output of the packet 
detector (Fig. S.6). The ptoses generated from these two signals are of bounded width during 
fault-free operation. If an erroneous or unstable input state is stored in the internal latches, the 
output of the packet detector will not change state and the width of the alarm puke becomes 
unbounded. This property is useful in designing alarm detectors. 

This fault detector design in Fig, S.6 is also applicable to delecting acknowledgment signal 
failures. The two separable states used on acknowledgiBeot bundles are the two singletons 
{ 111...111 }, and { 000...000 }. An AND gate is a receptor for the first set, an OR gate a receptor for 
die second 

Note that under die random pulse train and random wave train fault models pathological input 
conditions need not persist until detected. Such conditions may exist for a short time and then 

■■'■'* *"V- "'V. 

disappear before any fault detection mechanism has reacted to them. The fault detection scheme 
nonetheless assures, if only in a probabilistic sense due to metastable state phenomena, that no 
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erroneous packet wftl be delivered to die p-modtdes. 



S3 Implementation of the Synchronizing Merge Modvie 



With the synchronizer design and decoding section design presented in Section S.1 and 5 X we 
have a complete methodology for incorporating redundant hardware in determinate packet 
communication systems to combat hardware failures. The synchronizing merge module is introduced 
to illustrate an extension of this methodology, to a class of non-determinate systems. The basic 
non-determinate operation we consider is that of merging two input streams into a single output 
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stream (Fig. 14). In a redundant system, mis operation is performed ma redundant merge module 
using synchronizing merge modules as p-modules in its slices. 

In a redundant merge module, every input word is decoded by a synchronizing decoder before 
being forwarded to a synchronizing merge module. A slice, however, forwards only one byte of each 
received word, so that a single byte slice failure affects at most one byte of any word delivered by the 
redundant merge module. To apply encoding techniques effectively, it is necessary to ensure that 
bytes from input words received at different input ports will not be "mixed up" together in a 
redundant merge module. In the terminology introduced in Section 2.1, it is necessary to ensure that 
bytes in the same batch generated by fault-free synchronizing merge modules in the same redundant 
merge module all belong to the same input word. The consistency problem is solved by allowing 
synchronizing merge modules in a redundant merge module to communicate with each other after 
receiving an input word, to jointly determine the input port from whkh every fault-free 
synchronizing merge module will forward the next oo^puUyie to n^suci^ssorredundaflt module. 

In Section 3.2, we have presented an algorithm for exchanging messages among synchronizing 
merge modules, so that even if failed modules can send di&ient messages to different modules, 
fault-free synchronizing merge modules can still agree on which input port to service next A 
hardware structure for synchronizing merge modules and imptaacntatioa of the consistency 
maintenance algorithm are described in this section. We first present a hardware structure for 
implementing synchronizing merge modules and review fee operation of a synchronizing merge 
module using this structure. Implementation of the consistency maintenance algorithm is men 
discussed in further detail and its fault tolerance capability explained. For maintaining timing 
synchronization, p-modules in a redundant module must be peiformance cornpatibte. Performance 
compatibility among synchronization merge modules implemented using these techniques is also 
analyzed. As noted in Section 3.2, only the single failure ease, m which four synchronizing merge 
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modules are used in a redundant merge module to tolerate hardware failures in any one of diem, is 
considered. 

53.1 Hardware StracUire of a S>ncliro«uingMefteMoAile 

Packet flow at an input port of a synchronizing merge module is regulated using a forward and 
wait (FW) module (Fig. 5.7a). A FW module forwards the input packets it has received directly to an 
output cntl module. For each packet received, the FW module also generates an output request 
tagged with the corresponding input port identifier (Q or 1). These output requests are merged 
together and then processed by output cntl. A FW module will accept a new packet only after it has 
received acknowledgments for both the preceding packet, from output cntl, and for the output 
request ithas generated for the preceding packet, from the merge module. 

A merge module receives input packets at its two input pom and forwards them at its output 
port as they arrive. When a confhet arises due to two input packets arming wrdrin a short time 
interval, arbitration is performed and the two packets may be forwarded in either order. 

When a new request arrives at output cntl, its tag (which will be referred to as the request 
source) is used to invoke the decision algorithm. The input source for me next output packet is 
selected jointly with other synchronizing merge modules under this decision algorithm. If the input 
source is different from die request source proposed by ow^f oifA theconespomhng output request 
for die input source will be pending at the output of the merge module. This is because the input 
packet from die rejected request source has not yet been accepted and the corTespwMimgFWinodute 
is designed not to process a new input packet from the input port it is guarding under the 
circumstances. Thus the output request from the input source will be die only one pending at the 
merge module, and will be forwarded to die output cntl module next Ow(pw/ en// riworbs this output 
request for the input source from the merge module, accepts an input word from the FW module 
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Fig. 5.7. The Synchronizing Merge Module. 



guarding the input source, extracts one byte from it, and forwards this byte to the successor 
redundant module via a fanout module. When output cntl receives the acknowledgment for mis byte, 
it returns an acknowledgment to the FW module, and if it already has a request source that was 
rejected in the previous round of decision, it will immediately resubmit this request source. 
Otherwise it waits for a new request to arrive from the merge module before activating the decision 
algorithm again. 
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We will omit the implementation details for me FW module and the output cntl module since 
they do not illustrate new fauh tolerance tech ni q u es . 

For implementation of the consistency maintenance algorithm, request sources and input 
sources arc exchanged between output cntl and decision box under a generalized data link/control link 
protocol (Fig. 5.7b). The data link/control link protocol is explained in Chapter 1 (Fig. 1.1). Under 
the generalized data link/control fink protocol, die data fink consists of two data wires for sending 
input sources and request sources in opposite directions (Fig. 5.7b). A new request source is available 
to decision box when a ready signal is sent on the control fink. The corresponding input source is 
available on the other data wire when die corresponding acknowledgment signal is returned. 

5J.2 Implementation of the Decision Afeoritfan 

The decision algorithm in a synchronizing merge module is invoked with either Q or 1 as its 
request source. For every request source received from output cntl the following hardware operations 
arc carried out, corresponding -to Steps (i) through (v) of me decision afepridttn presented in Section 
3.2. The data path in synchronizing merge module A for carrying out these operations is shown in 
Fig. 5.8b. In this figure die symbol "Xyz" denotes die request source of X as told to Z by Y. 

(i) The request source received from output cntl is broadcast to afl decision boxes. This step is 
implemented by broadcasting die signal received on A to all derision boxes, as shown in the 
upper left hand corner of Fig. 5.8b. 

(ii) Request sources from other decision boxes are stored in latches. 

(iii) Request sources stored in latches are broadcast to all derision boxes. 



-119 





delay 5j - 






delay 82 




_ arknnwlmfoment 








request 










input 


■ 








source 


source 




pulse generator 


~) - 












k 










\ data nath 





















A- 



V BA 

V CA" 

^DA 



V AA 



*AB 



V AC 



V AD 



maj 



"R TT 

to other synchronizing from other synchronizing 
merge modules merge modules 



B, 



BA - 





(a) Decision Box. 


p p 


1 


n B AA 


"71 


a 


-» B AB 


a 


t - 


C CA~* 


t - 


c 
h 


-*- B AC 


c 
h 




B 

a* b! 



AA 
CA 



^A 



AA 



-AB 



-AC 



°DA^ 



'AD 



'AD 




maj 



•B' C 



'AA 
BA 




maj 



-DA 



D 
-C D 
D 



AA 
BA 



CA 



.D 



AA 



.D 



AB 



-D 



AC 



•D 



AD 




maj 



D' 



C D' 



A' B' C 

\LJL/ 



thsh(3,4) 



input source 

(b) Decision box data path for synchronizing merge module A. 
Fig. 5.8. An implementation of the decision algorithm. 



-120- 



(i v) The request source for each synchronizing merge module is determined using a majority voter. 

(v) Request sources determined in Step (iv) are fed to a threshold circuit tnsh (3, 4) (See Section 5.1) 
to derive a new input source. 

In each round of joint decision, latches in a decision box should only be activated after new 
request sources arc available from all fault-free synchronizing merge modules in die same redundant 
module. And then only after the outputs from all these latches have settled down, and a new input 
source derived from them using majority voters and the threshold circuit, should a decision box 
return an acknowledgment to its output cnti module. Our approach to synchronize the operations in 
a decision box with those in other decision boxes is to make use of two derived upper bounds 8j and 
5 2 , such that 

(1) If a decision box receives a new request source from its output cnti module at t, then it will receive 
a new request source from every fault-free synchronizing merge module in die sane redundant 
merge module at t + dj. 

(2) If new request sources are available for storing in latches in a decision box at t + fij, then anew 
input source will be available at die output of die threshold circuit thsh (3, 4) in that decision box 
att + o^+oj. 

Based on these two upper bounds, our synchronization strategy is to generate a pube in a 
decision box 8j seconds after it has received a new request source from its decision box, to store new 
request sources into latches, and then return an acknowledgment to output cnti h^ seconds later (Fig. 
5.8a). We next consider factors that determine 8j and t^. fij and Sj can be computed for an actual 
implementation based on these considerations. 
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In the decision box of synchronizing merge module A (Fig. 5.8b), for example, 8j must be 
sufficiently long such that request sources arc delivered at signal lines labeled "A", "Bg A ", "Cqj^" 
and "I^da" within 6j of each other. We can identify two components for Sy One is an upper 
bound on the variations in propagation delay along paths leading from interfaces between decision 
boxes and output cntl modules to latches in decision boxes. This bound can be derived by taking 
delay measurements in an implementation. The second component is an upper bound on the 
difference in arrival time among request sources to be submitted for the same decision, as they are 
delivered to decision boxes in the same redundant merge module. To calculate this component we 
must consider the various situations that may occur in other decision boxes as a decision box receives 
a new request source. 

If the new request source received by a decision box is one that has been rejected in a previous 
decision, a request from the same input port must be pending at the merge module of every 
synchronizing merge module which has not yet received this request If the new request source is 
derived from a packet which has just arrived at the synchronizing merge module containing the 
decision box, the same packet will arrive at other synchronizing merge modules within a fixed time 
interval, due to in-phasc operation. Thus under either of these two situations, a new request source 
will arrive at all other decision boxes within a fixed interval, unless conflict arises at a merge module 
due to the simultaneous arrival of two output requests. When a conflict arises, the merge module may 
be driven into its metastable state and may stay in that state for an arbitrarily long time. The 
probability that a conflict arises at time t and remains unresolved at time t + a decreases rapidly with 
a, for many implementation technologies. We can in practice pick a value aj and assume that any 
conflict that arises at time t is resolved by time t + aj. Under this assumption, then, we can also 
calculate an upper bound on the difference in arrival time among request sources for the same 
decision as they are delivered to decision boxes in the same redundant merge module. 
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To compute &2- tet "s consider Ac various situations the other decision boxes may be in when a 
decision box generates a pulse to store new request sources into its latches. Since not ati new request 
sources arrive at these decision boxes at the same time, and each decision box generates a latching 
pulse 5j seconds after k has received a new request source, latches in different decision boxes may 
not be activated together. After a decision box has generated its latching pulse, it must wait long 
enough until outputs from all latches in fault-free decision boxes have settled down. Thus if new 
request sources can arrive at decision boxes as far apart as/8 then $2 raB * accC( ' fi- Furthermore 0*2 
must be long enough to take into account the propagation delay along pans between decision boxes, 
and the time it takes for the majority voters and threshold encaits to generate a new input source 
after the latches have stabilized. There is yet another component which must be addedto dj, » take 
into account signals generated by failed synchronizing merge modules driving receiving latches into 
their mctastable states. Wc assume mat a latch will always have come out of its metastablc state « 2 
seconds after it has entered mis state, and use aj in computing Sj- 

As in die decoding section designs presented in Section 5.2, wc deal with stuck-at faults, 
random pulse trains and random wave trains aft at once by storing mtennodute messages in latches 
and assuming that if any of these latches is driven kito a mesattable state by a fauky signal, it wffl 
always come out of that state after a 2 seconds. 

There are two mctastable state phenomena we have faced in die synchronizing merge module 
design: two output requests arriving simultaneously at a merge module, and a random pulse train or a 
random wave train being delivered to a latch. The former occurs during fault-free operation while 
the latter is strictly a hardware failure symptom. In both cases we have assumed that if a hardware 
clement enters its mctastable state at t, then it would have come out of mat state by t + a, for some 
fixed a. By using larger and larger a\ this assumption can be made more and more accurate, at die 
expense of performance. This tradeoff between performance and reliability seems unavoidable in 
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non-determinate systems. 

We next analyze performance compatibility among copies of synchronizing merge modules 
implemented using the above techniques. There are three factors contributing to phase difference 
among packets or acknowledgments in the same batch generated by synchronizing merge modules in 
the same redundant merge module: 

(i) the phase difference among packets and acknowledgments in the same batch delivered to the 

synchronizing merge modules, 
(ii) the time it takes a merge module to resolve conflicts. This is necessary because conflicts may 

arise at some, but not necessarily aft", synchronizing merge modules. 
(iii) the two different courses of action that can be taken by output cm! after the input source is 

determined. 

We have assumed that acceptable reliability and fault coverage can be achieved by upper bounding 
(ii) with oj. Under this assumption the performance incompatibility between two synchronizing 
merge modules is bounded by some parameter which depends only on propagation and gate delays 
through various paths in and between these modules andin-phase operation can be maintained. 

5.4 Design Examples 

In this section our techniques for maintaining timing synchronization and consistency in 
redundant packet communication systems are illustrated by applying them to the design of 
fault-tolerant routers. 

A router is a 2 input/2 output building block module used in constructing packet networks for 
the packet communication computer architecture presented m the next chapter. This module 
receives packets at its two input ports and delivers each received packet at one of two output ports 
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according to a destination address carried by the packet, and is designed so that packets to be 
forwarded at different ports can be processed concurrently. Such concurrency is naturally supported 
by decomposing the router into two input modules (IM) and two output modules (OM) (Fig. 5.9a). 
An input module is a sequential machine which examines the destination address in each packet and 
forwards that packet to the output module specified. An output module is simply a synchronizing 
merge module. Straightforward application of the redundancy techniques presented in this and 
previous chapters leads to the redundant router design shown in Fig. 5.9b. Control modules 
belonging to the same synchronization set and OM modules wluch must be kept consistent are 
enclosed by dotted lines. This redundancy scheme can tolerate hardware faitares confined to a single 
unit consisting of two synchronuing decodeis, a router module aadtwo laneut modules. 

In an implementation of a non-redundant router using off-the-shelf components, it is more 
practical to implement packet communication using a sequence of 8-bit nibbles delivered under a 
data link/control link protocol (Fig. 1.1). Under this protocol redundancy can be applied to die data 
nibble and control link independently. FerfeuhmaAmglhedauiha^* can either be replicated or 
encoded. The control link can be pn*ccted by rephcatieaush^ tech*^^ 
chapters. A redundancy scheme based oa a singje-entfr-oonecimg Hammmg cede p9J code is shown 
in Fig. 5.10. This code uses a total of 12 binary signals to transmit 8 information bits. Each input 
nibble is decoded by four router units. The data outputs of these router units are voted upon to 
derive the 12 output bits at an output poit Control sjgm 

synchronization as they arrive at the router units. These unto aba ciehaagr messages to select input 
sources. For clarity the intcrunit synchronization paths are omitted in F%. 5.10. m using a separate 
control link to synchronize packet communication the delays incurred in different portions of die 



1. The term- byte is commonly used instead of wMrfrt in this context We 1 prefer the term nfflks 
since we have used the term byte to denote bit fields in a word generated by a redundant module. 
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(a) A 2 X 2 Router. 




(b) Fatrit-toterant Configuration 



Fig. 5.9. A fauh-toterant router design based on replicated redundancy. 
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(b) The Redundant Router 



Fig. S.10. Hardware structure of a 2 X 2 router using an error-correcting code. 
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data path must be known. Voter delays, for example, must be taken into account in asserting ready 
signals. In addition to hardware failures confined to a single router unit, single voter failures in this 
redundant router can also be tolerated. 

Compared with a replication scheme, a coding scheme requires less input and output 
connections among routers at the expense of more hardware packages and connections within a 
router. This is advantageous if a more expensive and/or less reliable technology is used for 
interroutcr connection. In the fault-tolerant architecture described in Chapter 6, a redundant router 
designed to detect hardware failures is used to construct packet networks. This router uses a parity 
check to protect the data link and has the same hardware structure as that shown in Fig. S.10. Its 
design is discussed in more detail in Section 6.3. 

5.5 Discussion 

In this chapter we have studied hardware implementation of the redundancy management 
algorithms presented in Chapter 3, and analyzed the fault tolerance capabilities of these 
implementations under the stuck-at fault model, the random pulse train fault model and the random 
wave train fault model. 

For the synchronizer implementation, different techniques are given for tolerating hardware 
failures under the three different fault models explained in Section 4.2. Since these three fault 
models form a strict hierarchy in terms of modeling power, any technique for tolerating hardware 
failures modeled by one of them is also adequate for tolerating hardware failures modeled by the less 
general ones. 

In the implementation of decoding sections, our approach is to first generate a synchronizer 
signal that is free of output hazards and runt pulses. Using this signal as a timing signal to store input 
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signal values into latches, we have shown how to reduce the fault tolerance problem of handling 
random pulse trains and random wave trains to a timing problem caused by metastable state 
phenomena in bistable devices. This same technique is also used is synchronising merge modules to 
deal with random pulse train and random wave train inputs. The timing problem can be dealt with, if 
only probabilistically, since in most implementation technologies the probability that a latch remains 
in its metastable state at time t + p after entering that state at t decreases with p. ft is thus possible to 
tradeoff reliability with performance in these situations. 

We have developed an approach to mask and detect hardware failures in a packet 
communication system based on maintaining timing sy«*io»iM6on and consistency throughout the 
enure system. A methodological issue that arises is wltethcfv these two-prop«rtieB are necessary for 
tolerating hardware failures in packet communication systems, it seems that consistency must always 
be maintained in a byte-sliced module, otherwise a packet batch generated by fault-free slices can 
have arbitrary contents, and byte slice failures can lead > to urdcaected and uncorrected errors. We 
next describe a technique for detecting hardware failures under trie random pulse train fault model 
without rnaintaining timing synrhroni7arion 

A popular scheme for fault detection in synchronous hardware systems is to operate two 
identical hardware units c»iKuniendy and deject feuur« *«r 

outputs. Note that there is not enough redundancy in a duplex system for maintaining tinting 
synchronization or consistency using our technique* Duplication can nooetijcless be exploited to 
detect failures in determinate systems, if hardware failures can be modeled as random pube teams 
and, as stated in Section 4.2, C-elements can be used to filter out random pulse trains. A decoder for 
duplex systems can be constructed using the technique illustrated in Fig. 5.6, except that a C-clement 
instead of a synchronizer is used tn activate input latches (1% 5.11). A fallout module consists of 
simply a C-dcmcnt and a byte detector for deriving die filtering signal from the packet bundle. 
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Fig. 5.11. A synchronizing decoder for duplex systems. 



This technique can be generalized to detect hardware failures under the random pulse train 
fault model using other encoding schemes. In such a scheme, a decoder will always wait for one 
packet from each neighboring fanout module before decoding, while a fanout module will always 
wait for one acknowledgment from each neighboring decoder before returning an acknowledgment 
to its p-module. 

A similar strategy for masking hardware failures is for a decoder to wait until it has received 
packets from sufficiently many neighboring fanout modules to derive a new error-free input word, 
and then forward this input word to its p-module. A fanout module is designed to forward a new 
byte it has received from its p-module to each decoder which has acknowledged the previous byte. 



In a redundant configuration based on byte slicing, it can be shown (See Davies and Wakerly 
[17] for an illustration of this phenomenon) that byte slice failures in redundant modules can cause 
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fault-free slices in their neighbors to go out of phase by an arbitrary amount Thus a byte slice that 
has fallen behind others in its module may not have foiled. To maintain communication under die 
adopted handshake protocol, a decoder should not return acknowledgments to a neighboring slice 
that has fallen behind before receiving new input bytes from this slice. The decoder should 
furthermore retain enough information, such as a byte count, to be able to deduce when mis 
neighboring slice has caught up with other slices in its module. Since a neighboring slice can fall 
arbitrarily far behind, there is no upper bound on the amount of information, and hence the storage 
for it, that must be kept in a decoder. This decoding strategy is thus theoretically not realizable, 
although a practical decoder implementation can be constructed using sufficiently large counters. 

There is a similar theoretical problem in implementing this strategy in a fanout module. A 
packet must be retained in a fanout module until it has been delivered to every neighboring decoder, 
since the contents of an output packet generated by a p-moduk may depend on die entire stream of 
input packets it has previously received and processed. A neighboring decoder can again fall 
arbitrarily far behind and there is no upper bound on the amount of buffering required to store 
undelivered packets in a fanout module. The question of how much buffering to include in each 
fanout module, and buffer management techniques, present more of a practical problem. 
Techniques for estimating die buffering requirements accurately must also be developed since the 
effectiveness of this strategy in combating hardware fafl ur e * depe nd * on the accuracy of such an 
estimation. 

By maintaining timing synchronization in a redundant packet communication system, all 
fault-free byte slices in a redundant module are always in-phase, and the buffering problems that 
arise in die above scheme is avoided. Instead of buffering facilities, additional hardware and 
interconnection paths arc invested to implement die synch roniza t io n algorithm for maintaining 
timing synchronization. 
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6. Design of a Fault-Tolerant Packet Cowmuiicatioo Computer Architecture 

In this chapter we describe a conceptual design for a fault-tolerant data flow processor intended 
for physics simulation applications. The salient features of (bis design have been introduced in 
Chapter 1: 

- H|gh performance and fault tolerance are achieved by using pools of identical hardware units. 

- Communication between processing elements and functional units is supported by packet 

networks. 

- Hardware in fte data flow processes is ^ 

As observed in Chapter 1, mere are no stringent reliability requirements inherent in physies 
simulation applications. We are interested in adding fault tolerance capabilities to the non-redundant 
system primarily to improve maintainability and avaaability. System throughput is improved if 
hardware failures can be masked, especially since k is not unusual for a numerical computation in 
physics simulation to execute for many hours. 

Fault-tolerant mechanisms employed in the redundant data flow processor are organized 
according to a dynamic mbmekmy scheme. In this scheme hardware failures are delected and 
diagnosed, the data flow processor is repaired, and then afflicted subcomputations are reexecuted on 
the repaired, possibly degraded, system. We assume act programs for the data flow processor are 
prepared on a host machine, loaded into the data flow processor, and then e x e c u ted. This host 
machine is also assigned me tasks of configuration control, and of coordinating diagnosis, repair and 
recovery activities with program execution. We win explain the strategy to be implemented on this 
host machine, and the application of hardware redundancy techniques developed in previous 
chapters and some packet encoding techniques to implementing hardware modules in support of this 
strategy. A fault-tolerant implementation of this strategy on the host machine can be constructed 



132 



using conventional techniques, and wiO not be considered. 

The hardware organization of the data flow processor and its operation are explained in Section 
6.1, where we wilt focus on those aspects of its operation relevant to fault tolerance considerations. 
The dynamic redundancy scheme is explained in Section 62. Hardware module designs to satisfy 
fault tolerance requirements imposed on them by the dynamic redundancy scheme are presented in 
Section 6.3. Strategies for incorporating additional hardware into a packet network to support rapid 
repair are discussed in Section 6.4. 

6.1 A Packet Co MMun ication Computer Ardrittct«re 

6.1.1 



Hardware in a packet communication computer architecture is organized as an interconnection 
of self-timed modules which communicate by sending packets to each other. Each packet is 
transmitted as a sequence of nibbles between modules. Packet nibbles aje delivered and received by 
hardware modules using the ready/acknowledge protocol depicted m Fig. 1.1. Each module port 
consists of a bundle of data wires and a pair of ccotroi wro 03g. Lla). Packet communication is 
synchronized by sending control signals over the control wires. Avaflabihty of a new nibble at a 
connection is signaled by sending a rectify signal over me ready wins, as receipt by returning an 
acknowledge signal over the acknowledge wire. Jtaaay and acknow le d ge signals are re p re se nt ed by 
signal transitions (Fig. Lib) on die re s pective < 



The major modules in the data now processor (Fig. 6.1) are processing elements (PEs), 
specialized functional units (SFUs), a routing network and an allocation network. Scalar operations 
are processed in the PEs and SFUs. The networks support packet traffic among the PEs and SFUs. 
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Fig. 6.1. Hardware architecture of the data flow processor. 



Operating Principles 



A machine level program for die data flow processor is stored in the PEs as a set of activity 
templates [20]. An activity template A contains an apemtim eoaewAlteaikirmes of oveot mate 
activity templates, called A's target templates, whk* should receive me result generated by processing 
A. An activity template is uaiquely identified by it* address, which has two components: a 
destination tag which specifies the Hi in which die template resides, and die location of die template 
within that Kl 
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Operations are divided into two classes, according to whether they are executed by a re or by a 
SFU. When activity template A is enabled (See below for a discussion of the enabling conditions) 
and the operation it specifies is executable on aPE, the operation is applied to the operands A has 
received and die result of the application is dispatched to A** taigeMemplate*. If a target template B 
resides on another PE, a copy of the result is tagged with B*s address and a nibble count to form a 
remit packet, and delivered to B via the routing network. For an operation executed on a SFU, an 
operation packet cons&ing of die operation code, operands and die target template addresses is 
formed and delivered to a SFU via the routing network and the allocation network. At a SFU the 
operation specified in an operation packet is applied to die operands and result packets are generated 
and dispatched to the target templates via the routing network. 

An activity template A is enabled when two conditions are met First of all die operands 
required for the operation must have arrived. The second condition is a consequence of organizing 
machine level programs in die data flow processor to support pipelining and iteration. Suppose that 
in such a program activity template A sends resuitt to activity temp i aftt B residing in another PE To 
avoid deadlock (38), an operand sent from A to B must be processed before A can send B a second 
operand. The execution of A and B are synchronized by conditioning each activation of A on 
receiving an acknowledge packet from B. This acknowledge packet is transmitted when B s 
processed with the previous operand received from A. Thus before an activity template can be 
executed, it must have received the necessary actnowtedf^wnlsnidn fo target tempfaues. A detafled 
explanation of this synchronization scheme is given in f26J. Under thb scheme every result packet 
transmitted through the networks is acknowledged by an acknowledge packet returned by die target 
template. We wiff also make the assumption diat «ue* operation packet contains exaciy one target 
template address. This assumption results in no loss of generality since the result obtained by 
processing an operation packet can be further distributed through its target template. In fault-free 
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operation under the rtove synchronization scheme and this assumption on operation packets, a PE 
will receive exactly one acknowledge packet fer every result packet or operation packet it delivers to 
the routing network. This property will be used to coordinate recovery activities with normal 
program execution in the dynamic redundancy scheme. 

A PE provides storage for activity templates, executes simple operations, sends packets to and 
receives packets from die routing network. A SFU is designed to execute complex operations, such 
as floating point arithmetic, efficiently, with additional capabilities to receive operation packets from 
the allocation network and send result packets to the routing network. Implementation techniques 
for these hardware modules as discussed after their fault tolerance requirements have been 
determined. In tile remainder of mis section we take a closer look at the structure of the routing 
network and the allocation network. 

6.1.2 Packet Networks 

An N X N routing network, with N input ports and N output ports, supports packet 
communication among N PEs. h accepts packets at its input ports and transmits each packet at die 
output port specified by a bit field comprising a header or destination tag of the packet The 
destination tag and a nibble count spearying the length of the packet are contained in die first few 
nibbles of a packet Routing networks can be constructed using 2X2 routers. A 2 X 2 router 
receives packets at its two input ports and delivers etch received packet at one of two output ports 
according to the destination tag carried by die packet The 2 X 2 router is designed to that packets to 
be forwarded at different output ports can be processed concurrently. We have described a hardware 
organization for the router module (Fig. 5.10a) and several designs for fault-tolerant routers (Fig. 5.10 
and Fig. 5.11) in Section 5.4 to illustrate our hardware redundancy techniques. 
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Methods and techniques for tolerating hardware failures in routing networks will be illustrated 
using rectangular routing networks. An N X N /ar*aftf»&r network k bwk from 2X2 routers by 
the recursive construction illustrated in Fig. W. An N X N network soconstructed has log 2 N stages 
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Fig. 6 J. Rectangular routing networks. 
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each of which contains N/2 routers. AH packets sent to an output port of the routing network, 
regardless of their sources, have identical destination tags. Routers in succeeding stages in the 
network examine successive bits in a destination tag to forward the packet along the proper path. 
Path control in a routing network is distributed among the routers. There is no centralized control 
mechanism whose complexity must grow with network size and which may become a performance 
bottleneck. Many packets can be forwarded concurrently to provide a high throughput rate. One of 
the design objectives for a fault-tolerant routing network is to retain all these characteristics during 
fault-free operation: decentralized path control, parallel processing, and asynchronous nibble-serial 
communication. 

In the data flow processor the allocation network receives operation packets from the routing 
network and distributes them among the SFUs. Each of its input ports has its own routing network 
address. In the data flow processor we use a rectangular allocation network which also has the 
topology shown in Fig. 6.2, constructed out of 2 X 2 allocators. An allocator receives operation 
packets from its two input ports and forwards them at its output ports as these output ports become 
free. It is possible for operation packets to be temporarily "trapped" in a section of the Allocation 
Network waiting for service even though SFUs not reachable from this subnetwork are free. Such 
trapping has the pleasing property of automatically diverting other operation packets from the 
congested subnetwork. We next present an allocator implementation constructed from determinate 
modules and merge modules. This implementation can be rendered fault-tolerant using exactly the 
same techniques illustrated in Section 5.4 for router modules. 

The implementation strategy we have adopted for the allocator is to generate service requests, 
for new input packets, and availability tokens, when output ports become free. Service requests from 
the two input ports are then merged together, as are availability tokens from the two output ports. 
Each service request is matched with an availability token, and a packet is transferred from the 



138 



requesting input port to die available output port In the hardware implementation (Fig. 6.3), service 
requests are generated by input port controllers, availability tokens by output port controllers, and they 
are matched at a matcher module which also handles packet transfer from input port controllers to 
output port controllers. 

6.2 A Fault Tolerance Strategy Based on Dynamic Redundancy 

It is possible to apply static redundancy techniques uniformly in the data flow processor to 
implement a fault masking capability in hardware. We have presented two router designs in Section 
5.4 appropriate for constructing fault masking routing networks, and pointed out mat similar 
techniques are applicable to allocation networks. We have also designed a router module, based on 
parity checks, to support fault detection and, again, similar techniques are applicable to the design of 
fault-detecting allocator modules. This router design is described in the next section. A packet 
network constructed with these fault-detecting modules has considerably fewer intermodule 
connections than one based on replication (using the type of router modules illustrated in Fig. 510), 
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Fig. 63. A hardware implementation of an allocator module. 
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and fewer hardware packages and intermodule connections than one based on error-correcting codes 
(using die type of router modules illustrated in Fig. 5.11). It is thus attractive to develop a fault 
masking strategy which can exploit die advantages of network designs based on fault-detecting 
modules. Such a strategy leads to a dynamic redundancy scheme. 

In a dynamic redundancy scheme, failures are masked through hardware-implemented fault 
detection, diagnosis, repair, followed by reexecution of the afflicted subcomputations. To support 
the dynamic redundancy scheme, each PE must store additional information that is not used in a 
non-redundant data flow processor. A copy of every result and operation packet delivered to die 
packet transport and processing CjPTP) subsystem consisting of the packet networks and the SFUs 
must be kept until die packet is acknowledged. At each PE die sender of every received packet must 
be known. This information can be maintained by associating each operand position in an activity 
template with a template address for returning acknowledgments. The sender of each 
acknowledgment packet and retransmission request(See below) must also be identifiable. 

The dynamic redundancy scheme assumes that the hardware implementation of the data flow 
processor has die following properties, even in the presence of hardware failures. 

(1) Each packetreceived by the rotting network is deliver^ 

destination tag, either to a PEer die afloatioa network. Each packet received by the 
allocation network is delivered to a SFU. Specifically, neither packets nor packet nibbles 
will be lost in die networks. 

(2) F^h packet ctelivered to a iietwork output poAi^ 

(3) Target template addresses carried in an operation packet are always delivered free of error. 

(4) For every operation packet received, a SFU will deliver either an error-free result packet or 
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one tagged as erroneous to the target template specified. 

(S) Every acknowledge packet and retransmission request (See below) is delivered free of error 
to its destination. 

Techniques to design redundant routers, allocators, PEs and SFUs that can be used to implement a 
data flow processor with these properties, even when up to one hardware package in each module has 
railed, will be presented in Section 6.3. The fault tolerance strategy is as follows. 

Any packet which has encountered a faulty router or allocator in its journey through the 
networks will be marked as such upon delivery to fte destination rl^orSFU."When a SFU receives 
an operation packet tagged as erroneous, it will generate a result packet, tag ft as erroneous, and 
forward it to the target template specified in the operation packet if a PE receives a result packet 
tagged as erroneous, it signals the host machine, which in turn signals the other PEs to stop sending 
packets to die routing network. AH packets in transit wffl alfifve at their destinations after a finite 
time period, which can be determined from hardware parameters specified for the PTP subsystem. 
After this time period the PTP subsystem can be repaired under the direction of error signals 
generated by fault detectors in this subsystem. After repair the PEs are restarted. A PE which has 
received a contaminated packet wifl issue a rttraMmmm nqpat instead of an acknowledgment to 
the sender. Program execution can other wi se p rocee d nttr—BTjr. 

Activity templates representing a machine level program, and a complete intermediate state of 
the computation in progress, are stored in the PEs. If only routers, allocators or SFUs have failed, the 
computation can always be restarted from fte mtefmedka* state stored in me PEs and ran to 
completion after the networks have been repaired. If failures occur in the storage components of a 
PE, it may be necessary to abort the computation in progress, since die intermediate state may 
become inconsistent. Storage failures tins must be masked to achieve complete fault masking. 
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Failure in other components of a PE need not be masked in hardware, but the activity templates and 
partial intermediate state stored in it must be relocated before processing can resume. If an activity 
template A is relocated, the entire activity template set must be relinked so that other templates 
having A as a target will contain the new address of A. It thus seems desirable to mask all failures in 
PEs locally in hardware. Communication between PEs and me host machine to coordinate repair can 
be implemented with an interprocessor bus. Fault-tolerant busing structures have been presented in 
[30]and(67]. 

The PTP subsystem must be repaired after failures are detected. A foiled module can be 
replaced or repaired in place manually. Availability is improved by reconfiguring around the failed 
module automatically and then repairing the failed module off-Mae. For a computation in progress 
to proceed successfully, the full functionality of the renting network must be retained, Le., packet 
communication between any input port and any output port must be maintained. Two strategies for 
incorporating spare routers and data paths into a routing network are discussed in Section 6.4. Under 
these strategies die full capability of a routing network is retained so long as spares are not exhausted. 
These strategies are also applicable to allocation networks. A SFU failure can be repaired by simply 
taking it off-line. An allocator will not forward operation packets to any SFU whiett-has stopped 
acknowledging inputs. The machine architecture is gracefully degradable with respect to SFU failures 
in this sense. SFU failures have no effect omer man degraded performance across me FTP 
subsystem boundary. 

The dynamic redundancy scheme we have described is built directly on the execution control 
mechanism m a data flow processor. It has the merit that extensions to die execution control 
mechanism are incorporated in low level hardware functions and require no extra programming 
effort to achieve fault tolerance. 
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63 Modi* DesJfi 

Our basic unit for fault tolerance considerations is a package, which receives input signals and 
delivers output signals through its terminals. Hardware modules are constructed using packages. We 
also assume that the mean time to repair is much shorter than the mean time to failure. Modules 
described in this section are hence designed to tolerate up to one pacluige faihire per module. Indus 
section we present hardware redundancy and packet encoding techniques to support die five fault 
tolerance properties stated in Section 62. We fast describe the encoding techniques: 

- Control signals are generated in quadruplicate, from four failure-independent packages, 



Each data nfcbk is protected by a parky bit The nine bits of a parity-encoded n*ble are 
generated from nine faBuie-mdependcat packagM. 

Each packet nibble whose error-free transmission must be guaranteed is expanded into three 
nibbles. The second and third rubbles in each triplet are obtained by rotating bits in the given 
nibble one and two positions to the right, respectively: 

given nibble: ^rfh ^2^3^^S^6^^8* 

^nibble: Wl b 2 b 3Wtf* 

^tdbbte: WtoW*^ 

This encoding scheme can be regarded as an anplementation of triple modular redundancy in 
time instead of m space. Mbbies in rabtie am* fields and amjiftiftj aidrta fields, and 
identification tags in acknowledgment packets and rati H i t mi ss io n requests are protected using 
ttiis technique. 

An aD C/s data nibble is appended to the tail of each packet This nibble is used to flag packets 
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which contain erroneous nibbles. It is set to all l's by the first module which detects the parity 
violation, and is otherwise retransmitted as received The packet is accepted as error-free only if 
its flag consists of all O's. 

Using the hardware redundancy techniques developed in previous chapters and the above 
encoding techniques routers, allocators and SFUs can be implemented with the following fault 
tolerance properties: 

~ If at most one of the four control signals delivered on each quadruple is faulty, its pathological 
effects can be masked. 

- If at most one data wire in each parity-checked data link carries faulty signals, the nibble counts 
and addressing information in each packet can be retrieved. The test nibble in the 
corresponding output packet w*8 be flagged, ie* wiH not be all (f% if any packet nibble has 
violated the parity check. 

- If at most one package in the module has fatted, the above capabilities are not impaired and 
faulty signals are delivered on at most one output control wire and one output data wire at any 
module port 

We note that it is in fact possible to mask all single package failures in routers, allocators and 
SFUs with these fault tolerance capabilities if every packet nibble is triplicated using the 
rotate-and-repeat encoding scheme given above. This approach to fault masking leads to lower 
performance during fault-free operation, as compared with the dynamic redundancy scheme 
explained in the last section. 
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We next illustrate module design techniques with a router module design and a processing 
element design. These techniques art afeo applicable to impl e mentin g the desired fault tolerance 
capabilities in allocators and specialized functional units. 

Router Module Design 

A redundant router module (Fig. 6.4), just as its nonredundant counterpart, receives packets at 
its two input ports, and delivers each received packet at die output port specified by a destination tag 
carried in the packet Packet receipt and forwarding are synchronized by control ^gqab delivered in 
quadruplicate. Packet nibbles are parity encoded. The redundant router is implemented using four 
cAerfttrpackages and nine w/«rpaekages(Fig.6.4). B^chedrer(l%. 6.4a) has two input ports and 
two output pom. Control signals generated at the c o rr es p o n ding ports of the four checkers are 
grouped together at each port of me redundant routermodule. Thus, for example, me acknowledge 
signals generated by die four checkers at their Ij input ports (Fig. 6.4b) are grouped together at input 
port Ij of the redundant router. Data nibble outputs from the corresponding output ports of the four 
checkers are collected together and voted upon at the voters to derive outputs for an output port of 
die redundant router. In Fig. 6.4, die number of wires represented by each arrow is given in its label 
in parentheses. This number is omitted if ok arrow represeittS a singk wire. 

A checker receives aO input control and data signals delivered to the router, and implements 
several fault tolerance capabilities in addition to packet routing: 

- mask control signal failures when at most one contrd signal hi each quadruple is faulty. 

-- deduce the error-free nibble for every data nfobte encoded in triplicate using die 
rotate-and-repeat scheme given above. 
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-- set the last packet nibbk to all Ts if the last packet mbble it has received is not all O's or if parity 
violation has been delected. 

-- maintain timing synchronization and consistency using the techniques developed in previous 
chapters. For clarity, coaununkations paths for exchanguig syncntonizaoon signals and 
messages among checker packages to implement these redundancy management functions are 
omitted in Fig. 6.4. 

In me redundant router shown in Fig 6.4. data faults due to single checker failures win be masked at 
die voter packages. Control faults due to single checker failures and data faults due to single voter 
failures are detected and dealt with in hardware modules receiving packets from this redundant 
router module. 

Processing Ekment Design 

A PE provides storage for activity templates and fear operands, as well as functional 
capabilities for activity template processing, and input/output capabflitie* for packet commumcation 
and error report A detailed tegfc design for a non-redundant ffittfmg commercially avanabte 
components is given in |6& [24J. To support the dynamic redundancy scheme, a EE in a 
fault-tolerant data flow proc ess o r can be constructed using a lauk masting bft-sfieed memory £31 a 
redundant control unit, and a redundant functional unit (Fig. £3). 

A partial state of me computation in execution is stored in die bit-sliced memory. For the 
computation to be recoverable after single package faOures^aO such failures must be masked along 
the data path used to retrieve this information from die memory to the host machine. This is 
achieved by using a bit-sliced memory protected by aaeror-oarrecfinfvcode tostore this information, 
and a control unit with the same structure and operating principle, and hence die same fault tolerance 
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Fig. 6.S. Design of a fault-tolerant processing element 



capabilities, as the redundant router module shown in Fig. 6.4. The control unit consists of four 
failure-independent packages each of which receives all input signals delivered to the control unit 
The outputs of these four packages are grouped together or voted upon to form outputs of the 
redundant control unit 



Addresses and data transmitted between the redundant control unit and the bit-sliced memory 
system, and those transmitted between the control unit and the host machine, are encoded using an 
error-correcting code. Each bit-slice in the memory system stores one bit of a data word and has its 
own address decoder. Any hardware failure confined to within one bit-slice thus affects at most one 
bit of a data word and consequent errors can be corrected. Packet nibbles transmitted between the 
redundant control unit and (he routing network are parity-encoded. Since the redundant control unit 
has the same fault-tolerance capabilities as the redundant router, single package failures in the control 
unit cannot cause undetected erroneous packets to be delivered to another PE. 
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The functional unit is the only subunit in a PE that need not be completely fault-tolerant. 
Package failures in it must nonetheless be detectable. Many redundancy techniques are available for 
detecting failures in functional units. For a commercially available LSI functional unit chip, it is cost 
effective to detect single chip failures through duplication and naask these feitures through triple 
modular redundancy, as desired. Communication between the redundant control unit and the 
redundant functional unit can also be protected using either an error-correcting code or an 
error-detecting code, depending on whether failures in the functional unit are to be masked or 
detected. 



M Network Repair Strategies 

When a network hardware Mure is detected in the dynamic redunrtaiwy scheme, all PEs wiO 
stop sending packets to each other and after a predctermmed tone period fbe networks will be 
dormanL Before normal processing can resume the failed unit must be located and the networks 
must be repaired. In this section we will illustrate two repair strategies using routing networks. 

The first step in any repair procedure is to locate the failed router. Each checker package can 
generate an error signal upon detecting a parity violation. Failures in the last network stage are 
detected directly by parity checkers in die PEs and SFUs. Since checkers are quadruplicated in each 
router, PE and SFU, two or more error signals will be generated tat each )eg$nute coniplaint under 
the single package failure assumption. These error signals can be used to locate the faued router. 
Further diagnosis will be necessary to locate the failed packagefs). 

The most straightforward repair procedure is to make use of error signals generated by checker 
packages to locate the failed unit and then replace it mantM/Ty with a spare. This procedure requires 
no additional hardware, but system availability is directly related to the availability of maintenance 
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personnel. The personnel requirement can be reduced by incorporating self-repair features into a 
network. The error signals will be monitored by the host machine which will direct repair activities. 
Additional modules or data paths must be incorporated directly into the routing network to support 
self-repair. 

In the self-repair scheme illustrated in Fig. 6.6, a number of spare modules are appended to 
each routing network stage, switched in electrically to replace failed modules. Switching 
arrangements are incorporated systematically using switch packages, which have been introduced in 
[36J, to support system reconfiguration. A switch can be set in one of two modes, either "crossing" or 
"bending" (Fig. 6.6a) the pair of input leads to the pair of output leads. Spare routers are 
interspersed with active routers. The reconfiguration capability of this switching arrangement is 
illustrated in Fig. 6.6b where the i*" router is assumed faulty. Note that in this scheme a spare router 
cannot replace any faulty router below it in the column. Control signals for setting the switches can 
also be carried in the interswitch connections. This repair scheme requires many additional data 
paths and packages, and must be further enhanced to tolerate switch failures. It is thus practical only 
when the additional hardware costs are acceptable and the switches are much more reliable than the 
router modules. One technique to tolerate switch failures is to connect each switch to more than one 
neighboring switch so that an immediate neighbor which has failed can be bypassed. These switches 
are called tipplers in [58]. 

The additional data paths introduced can also be used for off-line diagnosis, testing out the 
routers systematically with pregenerated test patterns. In the configuration shown in Fig. 6.6, the i 
router can be tested by the host machine while the remaining routers carry the packet traffic. The 
fault detection mechanism in the dynamic redundancy scheme assumed single package failures in 
each router. Multiple package failures or lurking failures which have not yet manifested themselves 
are not detected. Network reliability can be further improved by testing the routers for these failures 
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periodkally or after detecting a fault in software. 

In the above strategy the topological and operation characteristics of a rectangular routing 
network are retained after reconfiguration. In a rectangular network any router, except for those in 
die last stage, can be paired together with a neighbor in the same stage such that the two can be used 
interchangeably in packet routing. If one router in a par fails, its duty can be taken over by its 
partner and the network can continue to operate, possibly with degraded performance. This scheme 
can be implemented by adding two input ports and two output ports to each redundant router. If 
there is a path in the nonredundant network from router A to router B, a new path between A and B's 
partner is added. The redundant paths incorporated into an 8 X 8 network (Fig. 6 lb) are shown in 
Fig. 6.7. The last stage can be repaired by using the previous scheme. A packet can be forwarded to 
its destination along two different paths at each enhanced router. Both of these paths can be used 
during normal operation when all routers are fault-free^ or one of Uum may be designated a spare to 
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Fig. 6.7. A reconfigurable 8X8 network with redundant paths. 
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be used only when the other path is blocked by a felted rotifer. Information on the location of failed 
routers can be distributed by the host machine during repair, to disaWe connections to failed routers. 

The host machine can keep a count of me number of firihircs reported for each router and take 
it off-line only when a pre determined maximum failure rate is exceeded. Spare modules can then be 
better utilized when transient failures dominate. We also note mat neither of these repair schemes 
require recomputing destination addresses in a partial executed data flow program to complete its 
execution after being interrupted by a network failure. 

6JS Discussion 

The STAR computer [8] and me FTSC computer [521 are two examples of fault-tolerant 
computing systems based on dynamic redundancy. Both of them have a bus-oriented architecture 
designed for executing sequential programs, in mis chapter we have presented a dynamic 
redundancy scheme for masking hardware failures in a multiprocessor system designed to execute 
parallel programs organized by data flow concepts. These programming concepts and the quest for 
high performance also distinguish our work from other fault-tolerant mumproceswr projects such as 
the FTMP system [30} and the SIFT system J67J. 

The dynamic redundancy scheme is developed by Brat specifying a set of fault tolerance 
capabilities for die packet transport and processing subsystem, cotiSHftng of the packet networks and 
the specialized functional units, in thfrdata flow proaessor. Sirategief for program execution, and for 
coordinating fault-related activities such as fault detection, diagnosis and repair with normal 
execution are then formulated based on these fault tolerance c a pa b ili ties We have also explained 
hardware redundancy and packet encoding techniques for implementing hardware modules and 
subsystems to support these strategies. Redundant hardware is incorporated and operated according 
to the redundancy management methodology developed in previous chapters. 
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The packet transport and processing subsystem is designed to be constructed out of hardware 
modules, instead of with busing structures « communicatioos technology. The fault tolerance 
capabilities specified lor this subsystem haw been chosen primarily because it is quite 
straightforward to both develop a system strategy based on these capabilities, and to implement them 
using familiar hardware redundancy concepts at the module level. The fault tolerance and 
maintainability features of me resulting system design can be precisely characterized and, if only 
informally, verified. A different set of fault tolerance capabilities for die packet transport and 
processing subsystem can be adopted for system design, leading to alternative system and 
implementation strategies. When detailed logic designs and hardware failure rates are available for a 
hardware implementation, alternative schemes should be carefully evaluated with reliability models 
(41] to determine their cost-effectiveness. 

We have assumed that hardware packages fail under normal use, and that failures are readily 
repaired. The redundancy schemes have thus been presented assuming at most one package failure 
in each hardware module. As long as there is at most one raited package in any module, die 
computation in progress can always be completed. If the processing elements are not designed to 
mask single package failures in their functional units, it may be necessary to relocate die activity 
templates and the partial state of die computation stored in a railed processing element, and relink 
the activity templates, before program execution can be resumed. The redundancy techniques can be 
extended to accommodate multiple package failures by using more packages in each router and more 
elaborate coding techniques. In a physical realization several packages can share a physical unit as 
long as the physical system is partitioned so that under the most common failure modes at most (me 
package in each module can fail. 

We have demonstrated how to methodically deal with hardware failures in a practical 
implementation of a highly parallel data flow processor, with no impact on its programmability. We 
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have explained how hardware failures can be masked when the architecture is programmed in a 
restricted data flew language. Another operational restriction is that every packet transmitted over 
the packet transport and processing subsystem b acknowledged by another packet The fault 
tolerance techniques explained in this chapter are directly applicable whenever die hardware 
architecture is programmed under these restrictions. It is expected that more -sophisticated system 
strategies must be developed to incorporate fault tolerance cost-effectively into more advanced data 
flow architectures M, (66). 
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7. Conclusion 

In this chapter we present a summary of results and suggestions for further research. 
7.1 Summary of Results 

We set out to answer me question: 

"How can hardware failures be tolerated in a self-timed hardware system organized by a 
packet communication architecture and designed to execute data flow programs V 

Our answer is provided in two parts. In the first part we study the general question of designing 
redundant packet systems for fault tolerance, and examine the issues of redundancy organization, 
redundancy management and fault modeling. In the second part we study architectural issues in the 
design of fault-tolerant data flow processors. Our results m these two areas are presented next, 
followed by an evaluation of whether these results have provided a satisfactory answer to the above 
question. 

7.1.1 Fault Tolerance in Serf-Timed Hardware Systems 

Research reported in this thesis has been conducted as part of an effort to develop a design 
methodology for constructing computer systems with suitable performance, programmability, fault 
tolerance and modularity characteristics. In particular, we have studied the problem of achieving 
fault tolerance in self-timed systems organized by a packet communication architecture. In the past 
there are significant speed and economic penalties for constructing self-timed systems using 
off-the-shelf components. These disadvantages are greatly alleviated in custom LSI chip design, an 
implementation approach appropriate for constructing large parallel processing systems for high 
speed numerical computation. 
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Many stuck-at faults cause hangups in self-tuned hardware systems whose modules interact via 
asynchronous handshake protocols. This property is often cited as evidence that self-timed systems 
provide natural support for fault isolation and fault diagnosis. There exists no methodology, 
however, for incorporating redundant hardware into self-timed systems for fault tolerance, prior to 
our work. 

Our major result in mis area is a complete methodology for incorporating redundant hardware 
into a class of packet systems for fault tolerance. This class includes all determinate systems and 
non-determinate systems constructed with merge modules. In the course of this investigation, we 
have addressed the following Issues: 

Structure of Redundant Modules 

We have presented a byte-sliced hardware organization for redundant packet communication 
modules. Byte slices in a redundant module can be designed and constructed from functional 
specifications for the nonrcdundant module and die chosen encoding scheme. We have afao 
described a class of asynchronous packet communication protocols £% use ia these mortMks 

Fault Modeling 

Hardware failures in redundant modules are characterized by a stuck-at fault model, a random 
pulse train fault model and a random wave train fault model In each fault model, we have specified 
the kind of signals that can be generated by failed modules as well as me interaction between such 
signals and fault-free hardware elements in fault handlers. These fault models provide a vigorous 
basis for studying fault-tolerant hardware implementations. 
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Redundancy Management 

We have developed an approach to coordinate byte shces in each redundant module so that 
failures among these slices can be detected and/or masked by decoding the outputs they generate. 
This approach is based on maintaining timing synchronization and consistency among byte slices in 
the same module. This approach to redundancy management is supported by two robust algorithms 
under which timing synchronization and consistency can be maintained even after certain hardware 
failures have occurred. The effectiveness of this approach in dealing with hardware failures 
characterized by the stuck-at fault model, the random pulse train fault model and the random wave 
train fault model is also investigated. We have also briefly discussed an alternative approach to 
redundancy management for fault tolerance, based on buffering schemes, and its relative merits and 
disadvantages. 

We have presented control module designs which can delect and/or mask hardware failures 
under the stuck-at fault model without ever generating runt pulses. Under the random pulse train 
fault model, these control modules can generate runt pulses due to metastaWe state phenomena, but 
the probability of such occurrences can be reduced to acceptable levels. We have also outlined an 
approach to deal with random wave train faults, but have not analyzed this approach is sufficient 
detail to evaluate its effectiveness. 

Our timing synchronizatkra technique is closely related to the synchronization voting technique 
studied by Davies and Wakerly [17j, and has applications in fauk-toleraBt dock design. We have 
clarified the problems in this ansa and their solution teehMqttcsihrough investigating tite hardware 
implenientaixmofourtimmgsynchromzat^ 
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7.1.2 Fault-Tolerant Data Flow Processor Design 

A data flow processor can be rendered fault-tolerant by implementing a fault masking 
capability in each of its hardware modules. As an alternative approach we have developed a fault 
tolerance strategy for masking hardware faoores m die data flow processor based on dynamic 
redundancy. 1ms alternative approach offices the potential of considerable hardware savings, 
especially in the packet networks. 

The data flow processor is a parallel processing system in which several machine instructions 
may be executed in parallel, and intermodule communication is supported by packet networks. The 
STAR computer [8] is probably the first fault-tolerant computer organized by a dynamic redundancy 
scheme. In mis computer fault detection is supported by using arithmetic codes in arithmetic units 
and system buses, and duplication in units that perform logic operations. Spare hardware modules 
and spare bus lines ace incorporated and switched in to replace failed modules after hardware failures 
are detected and diagnosed. The STAR compuaer executes macfiOK inarructions serialry, and state 
information is saved at program checkpoints for subsequent rollbacks to recover from detected 
failures after repair. Our dynamic redundancy scheme can be considered a variant of .the strategy 
implemented on the STAR computer, refined in accordance wfcfi the anftkectural characteristics of 
die data flow computer 

- Failures in processing elements, specialized functional units, routers and alkxaton are detected 
by designing these modules, and encoding packet nibbles, to sapportp8i% checks. 

- Spare routers and allocators are embedded in packet networks to support rapid repair. 

Homogeneous sets of processing elements and specialized functional units are provided so tint 
the system can degrade gracefully when these modules ML 

- Transmitted packets arc saved until they are acknowledged, and retransmitted upon request 
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The dynamic redundancy scheme is developed by first focusing on fault tolerance requirements 
for the packet networks. A system strategy for fault tolerance is then developed based on the selected 
requirements. These requirements are also used to derive specifications for hardware modules. The 
desired fault tolerance capabilities are implemented in these hardware modules using hardware 
redundancy techniques developed in the first part of this thesis. This same methodology can be used 
to design other dynamic redundancy schemes for fault-tolerant data flow processors, using different 
fault tolerance requirements for the networks. 

Our design is but one of many possible alternatives for constructing fault-tolerant data flow 
processors. It nonetheless demonstrates that fault tolerance can be incorporated into a high 
performance computing system with no impact on its pregrammabilky. 

7.1.3 Evaluation 

The data flow processor is designed to achieve high performance through parallel processing. 
This parallel processing capability is not compromised m the fault-tolerant processor design we have 
proposed. Individual hardware operations may take longer to perform due to input decoding and 
synchronization. But pipelined and concurrent operation of hardware modules are still supported. 
In other words, fault tolerance mechanisms incorporated according to our strategy lengthens the 
execution time of microscopic operations, but has no effect on concurrency exploitation at higher 
levels. 

For our intended applications, we also favor a self-timed implementation approach guided by 
packet communication principles. In constructing a totally self-timed system, the timing 
characteristics of individual modules need not be known. In our approach to redundancy 
management, we have relied upon performance compatibility among byte slices in the same 
redundant module to maintain proper synchronization, and we have also made assumptions on gate 
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delays and path delays in control module designs. For each redundant module, however, the only 
timing characteristics which must be made available to fts environment for its successful deployment 
are phase differences among packets and acknowledgments in the same output batch generated by 
fault-free slices in that module. In synchronous systems, upper bounds on execution times for bask 
hardware operations in each module must also be calculated. Our procedure for integrating 
redundant packet communication modules into fault-tolerant packet communication systems is thus 
somewhat more complex than that for self-timed systems, and yet considerably smrolef than that for 
synchronous systems. We have lost some, but not all, of the desirable modularity properties of 
self-timed systems in our fault tolerance methodology. 

We have shown how to construct redundant packet communication systems so mat hardware 
failures limited to some maximum number of byte shoes in each redundant module can be tolerated. 
We have, however, not addressed the issue of evaluating the reliability improvements attainable 
through such enhancements. We have developed a des^ memodok>gj fw incorpor^mg redmidant 
hardware into packet communication systems for fault tolerance, a set of mechanisms to support mis 
methodology, aid illustrated its application « a specific system design. Alternative approaches to 
redundant management, alternative ntechantsms to support mese approaches, and alternative system 
designs can be developed to guide me construction of fault- to i craut paraBef p roc essing systems based 
on self-timed principles. When detailed logic designs and hardware failure rates are avahnbfe, 
alternative schemes can be evaluated more carefully to compare their reliability properties and 
cost-effectiveness. In mis thesis, we have identified the technical issues and laid me foundation on 
which such alternatives can be developed and examined. 
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7.2 Suggestions for Further Research 

Wc have presented a design methodology for constructing fault-tolerant packet communication 
systems and a system design for a fault-tolerant data flow processor. Let us first consider some 
alternative solutions to redundancy management and packet network design in this framework, and 
then discuss how our concepts and techniques can be applied to achieve fault-tolerance in more 
general forms of data flow processors and in constructing fault-tolerant computers for other 
applications. 

Redundancy Management 

Our approach to redundancy management is based on maintaining timing synchronization and 
consistency in redundant systems. An alternative approach to maintaining timing synchronization is 
described briefly in Section 5.5. This approach is based on the idea of buffering packets and 
acknowledgments in control modules until they can be forwarded. It is conceptually simpler but 
implementation details must be worked out before its hardware requirements and fault tolerance 
capabilities under different fault models can be evaluated. Methods for determining buffer sizes in 
control modules and strategies for dealing with buffer overflow must aho be established. 

For our timing synchronization algorithm. 3f+ 1 synchronizers arc needed to tolerate failures in 
up to f synchronizers among them. The question of whether there exists robust timing 
synchronization algorithms which can be implemented with fewer synchronizers, or whether 3f+l is 
a lower bound for solving this problem remains unsettled. We note that under restricted fault 
assumptions, timing synchronization can be maintained using synchronization voting f 17], which does 
not require any synchronization among byte slices in a redundant module. 
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Wc have also suggested an approach to implement die timing synchronization algorithm in 
control modules under the random wave train fault model. In this approach we propose to use 
lowpass niters to eliminate random wave trains of bounded wavelength. This approach should be 
investigated using analog circuit design and analysis techniques to establish methods for dealing with 
random wave train faults. 

Packet Network Design 

Hardware failures in packet networks can be masked by implementing a fault masking 
capability, based on cither error correcting codes or replication, in every router and allocator. We 
have adopted an alternative approach under which some hardware failures arc masked and others are 
only detected. Other alternatives include relying entirely on packet encoding techniques, such as 
adding a checksum nibble 10 each packet, to detect failures, and using time-out mechanisms to 
control retransmission, as in the ARPAneL It is worthwhile to conduct a more systematic study of 
fault-tolerant network design, classifying different techniques according to the implementation 
technologies and failure modes for which they are effective. 

We have also described two strategies for incoreocatiag redundant hardware into a rectangular 
packet network to support rapid repair. Interconnection networks in large parallel processing systems 
often have regular structures and arc constructed out of a large number of identical units. There are 
thus many opportunities for exploiting structural regularity and uniformity in these networks in 
incorporating redundant modules and paths into men. 

Redundant Data Flow Computer Systems 

Wc have presented the design of a fault-tolerant data flow processor intended for numerical 
computation applications. This processor supports only a subset of die data flow language presented 
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in [22]. More general forms of data flow computer systems [1], [65] must be developed to support 
environments suitable for data base applications and general purpose computing. These systems 
should also be fault-tolerant to improve availability and maintainability. 

Redundant Microcomputer Systems 

Many real-time control applications can benefit from the availability of low-cost fault-tolerant 
computer systems. A redundant system can be constructed using several independently clocked 
microcomputers, and maintaining timing synchronization and consistency among them. 
Redundancy management can be implemented in hardware, by incorporating the suitable hardware 
mechanisms in the microcomputer chip, or in software, by incorporating the corresponding 
operations into application programs. It is a challenging research problem to design a fault-tolerant 
microcomputer system, and the corresponding programming methodology, for real-time control 
applications based on ttJese ideas. 

Finally, we note mat many engineering decisions must be made in applying the methodology 
explained in this diesis to the construction of practical fault-tolerant systems. Such decisions concern 
system modularization for redundancy incorporation, the choice of a suitable fault model for the 
implementation technology, and the choice of redundancy techniques subjected to reliability, 
performance and cost requirements. Methodologies can also be developed for making such 
decisions. 
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